Warning: Permanently added '2620:52:3:1:dead:beef:cafe:c29a' (ED25519) to the list of known hosts. cmd: ['copr-distgit-client', 'sources'] cwd: /var/lib/copr-rpmbuild/workspace/workdir-7hta920e/cutlass rc: 0 stdout: stderr: INFO: Reading stdout from command: git rev-parse --abbrev-ref HEAD INFO: Reading stdout from command: git rev-parse HEAD INFO: Reading sources specification file: sources Running (timeout=172800): unbuffer mock --spec /var/lib/copr-rpmbuild/workspace/workdir-7hta920e/cutlass/cutlass.spec --sources /var/lib/copr-rpmbuild/workspace/workdir-7hta920e/cutlass --resultdir /var/lib/copr-rpmbuild/results --uniqueext 1713469148.550083 -r /var/lib/copr-rpmbuild/results/configs/child.cfg INFO: mock.py version 5.5 starting (python version = 3.12.1, NVR = mock-5.5-1.fc39), args: /usr/libexec/mock/mock --spec /var/lib/copr-rpmbuild/workspace/workdir-7hta920e/cutlass/cutlass.spec --sources /var/lib/copr-rpmbuild/workspace/workdir-7hta920e/cutlass --resultdir /var/lib/copr-rpmbuild/results --uniqueext 1713469148.550083 -r /var/lib/copr-rpmbuild/results/configs/child.cfg Start(bootstrap): init plugins INFO: tmpfs initialized INFO: selinux enabled INFO: chroot_scan: initialized INFO: compress_logs: initialized Finish(bootstrap): init plugins Start: init plugins INFO: tmpfs initialized INFO: selinux enabled INFO: chroot_scan: initialized INFO: compress_logs: initialized Finish: init plugins INFO: Signal handler active Start: run INFO: Start(/var/lib/copr-rpmbuild/workspace/workdir-7hta920e/cutlass/cutlass.spec) Config(rhel+epel-8-ppc64le) Start: clean chroot Finish: clean chroot Mock Version: 5.5 INFO: Mock Version: 5.5 Start(bootstrap): chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-ppc64le-bootstrap-1713469148.550083/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start(bootstrap): cleaning package manager metadata Finish(bootstrap): cleaning package manager metadata INFO: Guessed host environment type: unknown INFO: Using bootstrap image: registry.access.redhat.com/ubi8/ubi INFO: Pulling image: registry.access.redhat.com/ubi8/ubi INFO: Copy content of container registry.access.redhat.com/ubi8/ubi to /var/lib/mock/rhel+epel-8-ppc64le-bootstrap-1713469148.550083/root INFO: Checking that registry.access.redhat.com/ubi8/ubi image matches host's architecture INFO: mounting registry.access.redhat.com/ubi8/ubi with podman image mount INFO: image registry.access.redhat.com/ubi8/ubi as /var/lib/containers/storage/overlay/87bf183c6edc75612a432898737e29e32994c72591d132ee5d9a06e35d99c115/merged INFO: umounting image registry.access.redhat.com/ubi8/ubi (/var/lib/containers/storage/overlay/87bf183c6edc75612a432898737e29e32994c72591d132ee5d9a06e35d99c115/merged) with podman image umount INFO: Package manager dnf detected and used (fallback) INFO: Not updating bootstrap chroot, bootstrap_image_ready=True Start(bootstrap): creating root cache Finish(bootstrap): creating root cache Finish(bootstrap): chroot init Start: chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start: cleaning package manager metadata Finish: cleaning package manager metadata INFO: enabled HW Info plugin INFO: Package manager dnf detected and used (direct choice) INFO: Buildroot is handled by package management downloaded with a bootstrap image: rpm-4.14.3-28.el8_9.ppc64le python3-dnf-4.7.0-19.el8.noarch python3-dnf-plugins-core-4.0.21-23.el8.noarch yum-4.7.0-19.el8.noarch Start: installing minimal buildroot with dnf No matches found for the following disable plugin patterns: local, spacewalk, versionlock Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Copr repository 2.5 MB/s | 620 kB 00:00 Additional repo copr_rezso_CUDA 607 kB/s | 69 kB 00:00 Additional repo http_developer_download_nvidia_ 12 MB/s | 3.3 MB 00:00 Additional repo http_developer_download_nvidia_ 5.9 MB/s | 2.0 MB 00:00 Additional repo http_developer_download_nvidia_ 5.1 MB/s | 1.8 MB 00:00 Red Hat Enterprise Linux - BaseOS 46 MB/s | 57 MB 00:01 Red Hat Enterprise Linux - AppStream 19 MB/s | 52 MB 00:02 Red Hat Enterprise Linux - CodeReady Linux Buil 7.4 MB/s | 7.1 MB 00:00 Extra Packages for Enterprise Linux 8 - ppc64le 22 MB/s | 16 MB 00:00 Modular dependency problems: Problem 1: nothing provides requested module(nvidia-driver:latest-dkms:20240416083839) Problem 2: nothing provides requested module(nvidia-driver:latest-dkms:20240416084055) Dependencies resolved. ============================================================================================ Package Arch Version Repository Size ============================================================================================ Installing: bash ppc64le 4.4.20-4.el8_6 rhel-baseos 1.6 M bzip2 ppc64le 1.0.6-26.el8 rhel-baseos 61 k coreutils ppc64le 8.30-15.el8 rhel-baseos 1.3 M cpio ppc64le 2.12-11.el8 rhel-baseos 270 k diffutils ppc64le 3.6-6.el8 rhel-baseos 367 k epel-rpm-macros noarch 8-41 epel 27 k findutils ppc64le 1:4.6.0-21.el8 rhel-baseos 542 k gawk ppc64le 4.2.1-4.el8 rhel-baseos 1.1 M gcc ppc64le 8.5.0-20.el8 rhel-appstream 21 M gcc-c++ ppc64le 8.5.0-20.el8 rhel-appstream 13 M grep ppc64le 3.1-6.el8 rhel-baseos 283 k gzip ppc64le 1.9-13.el8_5 rhel-baseos 170 k info ppc64le 6.5-7.el8 rhel-baseos 216 k make ppc64le 1:4.2.1-11.el8 rhel-baseos 504 k patch ppc64le 2.7.6-11.el8 rhel-baseos 146 k redhat-release ppc64le 8.9-0.1.el8 rhel-baseos 45 k redhat-rpm-config noarch 131-1.el8 rhel-appstream 91 k rpm-build ppc64le 4.14.3-28.el8_9 rhel-appstream 177 k sed ppc64le 4.5-5.el8 rhel-baseos 303 k tar ppc64le 2:1.30-9.el8 rhel-baseos 858 k unzip ppc64le 6.0-46.el8 rhel-baseos 198 k util-linux ppc64le 2.32.1-44.el8_9.1 rhel-baseos 2.6 M which ppc64le 2.21-20.el8 rhel-baseos 51 k xz ppc64le 5.2.4-4.el8_6 rhel-baseos 158 k Installing dependencies: annobin ppc64le 11.13-2.el8 rhel-appstream 974 k ansible-srpm-macros noarch 1-12.el8 epel 21 k audit-libs ppc64le 3.0.7-5.el8 rhel-baseos 136 k basesystem noarch 11-5.el8 rhel-baseos 11 k binutils ppc64le 2.30-123.el8 rhel-baseos 6.5 M brotli ppc64le 1.0.6-3.el8 rhel-baseos 329 k bzip2-libs ppc64le 1.0.6-26.el8 rhel-baseos 53 k ca-certificates noarch 2023.2.60_v7.0.306-80.0.el8_8 rhel-baseos 935 k chkconfig ppc64le 1.19.2-1.el8 rhel-baseos 204 k coreutils-common ppc64le 8.30-15.el8 rhel-baseos 2.0 M cpp ppc64le 8.5.0-20.el8 rhel-appstream 11 M cracklib ppc64le 2.9.6-15.el8 rhel-baseos 95 k cracklib-dicts ppc64le 2.9.6-15.el8 rhel-baseos 4.0 M crypto-policies noarch 20230731-1.git3177e06.el8 rhel-baseos 64 k curl ppc64le 7.61.1-33.el8_9.5 rhel-baseos 358 k cyrus-sasl-lib ppc64le 2.1.27-6.el8_5 rhel-baseos 135 k dwz ppc64le 0.12-10.el8 rhel-appstream 114 k efi-srpm-macros noarch 3-3.el8 rhel-appstream 22 k elfutils ppc64le 0.189-3.el8 rhel-baseos 570 k elfutils-default-yama-scope noarch 0.189-3.el8 rhel-baseos 52 k elfutils-libelf ppc64le 0.189-3.el8 rhel-baseos 239 k elfutils-libs ppc64le 0.189-3.el8 rhel-baseos 332 k expat ppc64le 2.2.5-11.el8_9.1 rhel-baseos 115 k file ppc64le 5.33-25.el8 rhel-baseos 78 k file-libs ppc64le 5.33-25.el8 rhel-baseos 552 k filesystem ppc64le 3.8-6.el8 rhel-baseos 1.1 M fpc-srpm-macros noarch 1.3-1.el8 epel 8.2 k gc ppc64le 7.6.4-3.el8 rhel-appstream 115 k gcc-plugin-annobin ppc64le 8.5.0-20.el8 rhel-appstream 38 k gdb-headless ppc64le 8.2-20.el8 rhel-appstream 3.5 M gdbm ppc64le 1:1.18-2.el8 rhel-baseos 136 k gdbm-libs ppc64le 1:1.18-2.el8 rhel-baseos 64 k ghc-srpm-macros noarch 1.4.2-7.el8 rhel-appstream 9.4 k glib2 ppc64le 2.56.4-161.el8 rhel-baseos 2.6 M glibc ppc64le 2.28-236.el8_9.12 rhel-baseos 3.4 M glibc-all-langpacks ppc64le 2.28-236.el8_9.12 rhel-baseos 26 M glibc-common ppc64le 2.28-236.el8_9.12 rhel-baseos 1.0 M glibc-devel ppc64le 2.28-236.el8_9.12 rhel-baseos 103 k glibc-gconv-extra ppc64le 2.28-236.el8_9.12 rhel-baseos 1.8 M glibc-headers ppc64le 2.28-236.el8_9.12 rhel-baseos 489 k gmp ppc64le 1:6.1.2-10.el8 rhel-baseos 292 k gnupg2 ppc64le 2.2.20-3.el8_6 rhel-baseos 2.7 M gnutls ppc64le 3.6.16-8.el8_9.3 rhel-baseos 992 k go-srpm-macros noarch 2-17.el8 rhel-appstream 13 k guile ppc64le 5:2.0.14-7.el8 rhel-appstream 3.5 M ima-evm-utils ppc64le 1.3.2-12.el8 rhel-baseos 68 k isl ppc64le 0.16.1-6.el8 rhel-appstream 1.2 M kernel-headers ppc64le 4.18.0-513.24.1.el8_9 rhel-baseos 11 M keyutils-libs ppc64le 1.5.10-9.el8 rhel-baseos 35 k krb5-libs ppc64le 1.18.2-26.el8_9 rhel-baseos 909 k libacl ppc64le 2.2.53-1.el8 rhel-baseos 37 k libarchive ppc64le 3.3.3-5.el8 rhel-baseos 412 k libasan ppc64le 8.5.0-20.el8 rhel-baseos 439 k libassuan ppc64le 2.5.1-3.el8 rhel-baseos 86 k libatomic ppc64le 8.5.0-20.el8 rhel-baseos 26 k libatomic_ops ppc64le 7.6.2-3.el8 rhel-appstream 38 k libattr ppc64le 2.4.48-3.el8 rhel-baseos 28 k libbabeltrace ppc64le 1.5.4-4.el8 rhel-baseos 219 k libblkid ppc64le 2.32.1-44.el8_9.1 rhel-baseos 242 k libcap ppc64le 2.48-6.el8_9 rhel-baseos 79 k libcap-ng ppc64le 0.7.11-1.el8 rhel-baseos 35 k libcom_err ppc64le 1.45.6-5.el8 rhel-baseos 50 k libcurl ppc64le 7.61.1-33.el8_9.5 rhel-baseos 321 k libdb ppc64le 5.3.28-42.el8_4 rhel-baseos 788 k libdb-utils ppc64le 5.3.28-42.el8_4 rhel-baseos 159 k libfdisk ppc64le 2.32.1-44.el8_9.1 rhel-baseos 270 k libffi ppc64le 3.1-24.el8 rhel-baseos 39 k libgcc ppc64le 8.5.0-20.el8 rhel-baseos 70 k libgcrypt ppc64le 1.8.5-7.el8_6 rhel-baseos 521 k libgomp ppc64le 8.5.0-20.el8 rhel-baseos 213 k libgpg-error ppc64le 1.31-1.el8 rhel-baseos 250 k libidn2 ppc64le 2.2.0-1.el8 rhel-baseos 96 k libksba ppc64le 1.3.5-9.el8_7 rhel-baseos 147 k libmount ppc64le 2.32.1-44.el8_9.1 rhel-baseos 260 k libmpc ppc64le 1.1.0-9.1.el8 rhel-appstream 65 k libnghttp2 ppc64le 1.33.0-5.el8_9 rhel-baseos 85 k libnsl2 ppc64le 1.2.0-2.20180605git4a062cf.el8 rhel-baseos 63 k libpkgconf ppc64le 1.4.2-1.el8 rhel-baseos 38 k libpsl ppc64le 0.20.2-6.el8 rhel-baseos 63 k libpwquality ppc64le 1.4.4-6.el8 rhel-baseos 109 k librtas ppc64le 2.0.2-1.el8 rhel-baseos 69 k libselinux ppc64le 2.9-8.el8 rhel-baseos 178 k libsemanage ppc64le 2.9-9.el8_6 rhel-baseos 179 k libsepol ppc64le 2.9-3.el8 rhel-baseos 368 k libsigsegv ppc64le 2.11-5.el8 rhel-baseos 31 k libsmartcols ppc64le 2.32.1-44.el8_9.1 rhel-baseos 192 k libssh ppc64le 0.9.6-13.el8_9 rhel-baseos 240 k libssh-config noarch 0.9.6-13.el8_9 rhel-baseos 21 k libstdc++ ppc64le 8.5.0-20.el8 rhel-baseos 498 k libstdc++-devel ppc64le 8.5.0-20.el8 rhel-appstream 2.1 M libtasn1 ppc64le 4.13-4.el8_7 rhel-baseos 82 k libtirpc ppc64le 1.1.4-8.el8 rhel-baseos 127 k libtool-ltdl ppc64le 2.4.6-25.el8 rhel-baseos 62 k libubsan ppc64le 8.5.0-20.el8 rhel-baseos 165 k libunistring ppc64le 0.9.9-3.el8 rhel-baseos 422 k libusbx ppc64le 1.0.23-4.el8 rhel-baseos 78 k libutempter ppc64le 1.1.6-14.el8 rhel-baseos 32 k libuuid ppc64le 2.32.1-44.el8_9.1 rhel-baseos 101 k libverto ppc64le 0.3.2-2.el8 rhel-baseos 25 k libxcrypt ppc64le 4.1.1-6.el8 rhel-baseos 77 k libxcrypt-devel ppc64le 4.1.1-6.el8 rhel-baseos 25 k libxml2 ppc64le 2.9.7-18.el8_9 rhel-baseos 754 k libzstd ppc64le 1.4.4-1.el8 rhel-baseos 276 k lua-libs ppc64le 5.3.4-12.el8 rhel-baseos 129 k lua-srpm-macros noarch 1-13.el8 epel 9.2 k lz4-libs ppc64le 1.8.3-3.el8_4 rhel-baseos 74 k mpfr ppc64le 3.1.6-1.el8 rhel-baseos 234 k ncurses ppc64le 6.1-10.20180224.el8 rhel-baseos 393 k ncurses-base noarch 6.1-10.20180224.el8 rhel-baseos 81 k ncurses-libs ppc64le 6.1-10.20180224.el8 rhel-baseos 361 k nettle ppc64le 3.4.1-7.el8 rhel-baseos 328 k npth ppc64le 1.5-4.el8 rhel-baseos 26 k ocaml-srpm-macros noarch 5-4.el8 rhel-appstream 9.5 k openblas-srpm-macros noarch 2-2.el8 rhel-appstream 8.0 k openldap ppc64le 2.4.46-18.el8 rhel-baseos 380 k openssl-libs ppc64le 1:1.1.1k-12.el8_9 rhel-baseos 1.5 M p11-kit ppc64le 0.23.22-1.el8 rhel-baseos 325 k p11-kit-trust ppc64le 0.23.22-1.el8 rhel-baseos 148 k pam ppc64le 1.3.1-27.el8 rhel-baseos 792 k pcre ppc64le 8.42-6.el8 rhel-baseos 206 k pcre2 ppc64le 10.32-3.el8_6 rhel-baseos 238 k perl-srpm-macros noarch 1-25.el8 rhel-appstream 11 k pkgconf ppc64le 1.4.2-1.el8 rhel-baseos 39 k pkgconf-m4 noarch 1.4.2-1.el8 rhel-baseos 17 k pkgconf-pkg-config ppc64le 1.4.2-1.el8 rhel-baseos 15 k platform-python ppc64le 3.6.8-56.el8_9.3 rhel-baseos 88 k platform-python-setuptools noarch 39.2.0-7.el8 rhel-baseos 632 k popt ppc64le 1.18-1.el8 rhel-baseos 65 k publicsuffix-list-dafsa noarch 20180723-1.el8 rhel-baseos 56 k python-rpm-macros noarch 3-45.el8 rhel-appstream 16 k python-srpm-macros noarch 3-45.el8 rhel-appstream 16 k python3-libs ppc64le 3.6.8-56.el8_9.3 rhel-baseos 8.1 M python3-pip-wheel noarch 9.0.3-23.el8_9.1 rhel-baseos 866 k python3-rpm-macros noarch 3-45.el8 rhel-appstream 15 k python3-setuptools-wheel noarch 39.2.0-7.el8 rhel-baseos 289 k qt5-srpm-macros noarch 5.15.3-1.el8 rhel-appstream 11 k readline ppc64le 7.0-10.el8 rhel-baseos 210 k rpm ppc64le 4.14.3-28.el8_9 rhel-baseos 545 k rpm-build-libs ppc64le 4.14.3-28.el8_9 rhel-baseos 166 k rpm-libs ppc64le 4.14.3-28.el8_9 rhel-baseos 381 k rust-srpm-macros noarch 5-2.el8 rhel-appstream 9.3 k setup noarch 2.12.2-9.el8 rhel-baseos 181 k shadow-utils ppc64le 2:4.6-19.el8 rhel-baseos 1.2 M sqlite-libs ppc64le 3.26.0-19.el8_9 rhel-baseos 626 k systemd-libs ppc64le 239-78.el8 rhel-baseos 1.1 M tpm2-tss ppc64le 2.3.2-5.el8 rhel-baseos 226 k tzdata noarch 2024a-1.el8 rhel-baseos 475 k xz-libs ppc64le 5.2.4-4.el8_6 rhel-baseos 112 k zip ppc64le 3.0-23.el8 rhel-baseos 275 k zlib ppc64le 1.2.11-25.el8 rhel-baseos 113 k zstd ppc64le 1.4.4-1.el8 rhel-appstream 346 k Transaction Summary ============================================================================================ Install 175 Packages Total download size: 167 M Installed size: 876 M Downloading Packages: (1/175): bzip2-libs-1.0.6-26.el8.ppc64le.rpm 213 kB/s | 53 kB 00:00 (2/175): cracklib-2.9.6-15.el8.ppc64le.rpm 346 kB/s | 95 kB 00:00 (3/175): bzip2-1.0.6-26.el8.ppc64le.rpm 202 kB/s | 61 kB 00:00 (4/175): libacl-2.2.53-1.el8.ppc64le.rpm 206 kB/s | 37 kB 00:00 (5/175): grep-3.1-6.el8.ppc64le.rpm 1.2 MB/s | 283 kB 00:00 (6/175): libassuan-2.5.1-3.el8.ppc64le.rpm 682 kB/s | 86 kB 00:00 (7/175): libattr-2.4.48-3.el8.ppc64le.rpm 273 kB/s | 28 kB 00:00 (8/175): cracklib-dicts-2.9.6-15.el8.ppc64le.rp 10 MB/s | 4.0 MB 00:00 (9/175): libgpg-error-1.31-1.el8.ppc64le.rpm 1.9 MB/s | 250 kB 00:00 (10/175): libnsl2-1.2.0-2.20180605git4a062cf.el 489 kB/s | 63 kB 00:00 (11/175): libpkgconf-1.4.2-1.el8.ppc64le.rpm 211 kB/s | 38 kB 00:00 (12/175): libsigsegv-2.11-5.el8.ppc64le.rpm 173 kB/s | 31 kB 00:00 (13/175): librtas-2.0.2-1.el8.ppc64le.rpm 376 kB/s | 69 kB 00:00 (14/175): libtool-ltdl-2.4.6-25.el8.ppc64le.rpm 359 kB/s | 62 kB 00:00 (15/175): libunistring-0.9.9-3.el8.ppc64le.rpm 3.4 MB/s | 422 kB 00:00 (16/175): libutempter-1.1.6-14.el8.ppc64le.rpm 240 kB/s | 32 kB 00:00 (17/175): mpfr-3.1.6-1.el8.ppc64le.rpm 2.1 MB/s | 234 kB 00:00 (18/175): npth-1.5-4.el8.ppc64le.rpm 251 kB/s | 26 kB 00:00 (19/175): pkgconf-1.4.2-1.el8.ppc64le.rpm 411 kB/s | 39 kB 00:00 (20/175): pkgconf-pkg-config-1.4.2-1.el8.ppc64l 155 kB/s | 15 kB 00:00 (21/175): readline-7.0-10.el8.ppc64le.rpm 1.9 MB/s | 210 kB 00:00 (22/175): zip-3.0-23.el8.ppc64le.rpm 2.2 MB/s | 275 kB 00:00 (23/175): basesystem-11-5.el8.noarch.rpm 97 kB/s | 11 kB 00:00 (24/175): pkgconf-m4-1.4.2-1.el8.noarch.rpm 168 kB/s | 17 kB 00:00 (25/175): publicsuffix-list-dafsa-20180723-1.el 556 kB/s | 56 kB 00:00 (26/175): gmp-6.1.2-10.el8.ppc64le.rpm 2.7 MB/s | 292 kB 00:00 (27/175): libidn2-2.2.0-1.el8.ppc64le.rpm 815 kB/s | 96 kB 00:00 (28/175): diffutils-3.6-6.el8.ppc64le.rpm 2.8 MB/s | 367 kB 00:00 (29/175): patch-2.7.6-11.el8.ppc64le.rpm 1.2 MB/s | 146 kB 00:00 (30/175): libusbx-1.0.23-4.el8.ppc64le.rpm 745 kB/s | 78 kB 00:00 (31/175): libpsl-0.20.2-6.el8.ppc64le.rpm 469 kB/s | 63 kB 00:00 (32/175): libzstd-1.4.4-1.el8.ppc64le.rpm 2.3 MB/s | 276 kB 00:00 (33/175): ima-evm-utils-1.3.2-12.el8.ppc64le.rp 667 kB/s | 68 kB 00:00 (34/175): brotli-1.0.6-3.el8.ppc64le.rpm 2.7 MB/s | 329 kB 00:00 (35/175): p11-kit-0.23.22-1.el8.ppc64le.rpm 1.7 MB/s | 325 kB 00:00 (36/175): popt-1.18-1.el8.ppc64le.rpm 472 kB/s | 65 kB 00:00 (37/175): p11-kit-trust-0.23.22-1.el8.ppc64le.r 906 kB/s | 148 kB 00:00 (38/175): libdb-utils-5.3.28-42.el8_4.ppc64le.r 1.4 MB/s | 159 kB 00:00 (39/175): libdb-5.3.28-42.el8_4.ppc64le.rpm 5.5 MB/s | 788 kB 00:00 (40/175): libxcrypt-devel-4.1.1-6.el8.ppc64le.r 191 kB/s | 25 kB 00:00 (41/175): lua-libs-5.3.4-12.el8.ppc64le.rpm 1.0 MB/s | 129 kB 00:00 (42/175): lz4-libs-1.8.3-3.el8_4.ppc64le.rpm 633 kB/s | 74 kB 00:00 (43/175): openldap-2.4.46-18.el8.ppc64le.rpm 3.1 MB/s | 380 kB 00:00 (44/175): pcre-8.42-6.el8.ppc64le.rpm 1.9 MB/s | 206 kB 00:00 (45/175): cyrus-sasl-lib-2.1.27-6.el8_5.ppc64le 1.2 MB/s | 135 kB 00:00 (46/175): filesystem-3.8-6.el8.ppc64le.rpm 6.6 MB/s | 1.1 MB 00:00 (47/175): libcap-ng-0.7.11-1.el8.ppc64le.rpm 314 kB/s | 35 kB 00:00 (48/175): keyutils-libs-1.5.10-9.el8.ppc64le.rp 296 kB/s | 35 kB 00:00 (49/175): libsepol-2.9-3.el8.ppc64le.rpm 2.6 MB/s | 368 kB 00:00 (50/175): libxcrypt-4.1.1-6.el8.ppc64le.rpm 661 kB/s | 77 kB 00:00 (51/175): nettle-3.4.1-7.el8.ppc64le.rpm 2.6 MB/s | 328 kB 00:00 (52/175): cpio-2.12-11.el8.ppc64le.rpm 1.3 MB/s | 270 kB 00:00 (53/175): gzip-1.9-13.el8_5.ppc64le.rpm 913 kB/s | 170 kB 00:00 (54/175): gawk-4.2.1-4.el8.ppc64le.rpm 4.7 MB/s | 1.1 MB 00:00 (55/175): info-6.5-7.el8.ppc64le.rpm 2.1 MB/s | 216 kB 00:00 (56/175): make-4.2.1-11.el8.ppc64le.rpm 4.0 MB/s | 504 kB 00:00 (57/175): sed-4.5-5.el8.ppc64le.rpm 2.5 MB/s | 303 kB 00:00 (58/175): unzip-6.0-46.el8.ppc64le.rpm 1.9 MB/s | 198 kB 00:00 (59/175): xz-5.2.4-4.el8_6.ppc64le.rpm 1.6 MB/s | 158 kB 00:00 (60/175): xz-libs-5.2.4-4.el8_6.ppc64le.rpm 1.0 MB/s | 112 kB 00:00 (61/175): gdbm-1.18-2.el8.ppc64le.rpm 1.3 MB/s | 136 kB 00:00 (62/175): bash-4.4.20-4.el8_6.ppc64le.rpm 9.5 MB/s | 1.6 MB 00:00 (63/175): gdbm-libs-1.18-2.el8.ppc64le.rpm 524 kB/s | 64 kB 00:00 (64/175): libbabeltrace-1.5.4-4.el8.ppc64le.rpm 1.3 MB/s | 219 kB 00:00 (65/175): libcom_err-1.45.6-5.el8.ppc64le.rpm 335 kB/s | 50 kB 00:00 (66/175): gnupg2-2.2.20-3.el8_6.ppc64le.rpm 8.0 MB/s | 2.7 MB 00:00 (67/175): libsemanage-2.9-9.el8_6.ppc64le.rpm 1.2 MB/s | 179 kB 00:00 (68/175): libgcrypt-1.8.5-7.el8_6.ppc64le.rpm 3.1 MB/s | 521 kB 00:00 (69/175): libtirpc-1.1.4-8.el8.ppc64le.rpm 1.2 MB/s | 127 kB 00:00 (70/175): libverto-0.3.2-2.el8.ppc64le.rpm 223 kB/s | 25 kB 00:00 (71/175): pcre2-10.32-3.el8_6.ppc64le.rpm 2.0 MB/s | 238 kB 00:00 (72/175): libarchive-3.3.3-5.el8.ppc64le.rpm 3.2 MB/s | 412 kB 00:00 (73/175): coreutils-common-8.30-15.el8.ppc64le. 9.3 MB/s | 2.0 MB 00:00 (74/175): glib2-2.56.4-161.el8.ppc64le.rpm 11 MB/s | 2.6 MB 00:00 (75/175): libffi-3.1-24.el8.ppc64le.rpm 350 kB/s | 39 kB 00:00 (76/175): libksba-1.3.5-9.el8_7.ppc64le.rpm 1.3 MB/s | 147 kB 00:00 (77/175): libselinux-2.9-8.el8.ppc64le.rpm 1.6 MB/s | 178 kB 00:00 (78/175): libtasn1-4.13-4.el8_7.ppc64le.rpm 695 kB/s | 82 kB 00:00 (79/175): platform-python-setuptools-39.2.0-7.e 4.3 MB/s | 632 kB 00:00 (80/175): setup-2.12.2-9.el8.noarch.rpm 1.7 MB/s | 181 kB 00:00 (81/175): tar-1.30-9.el8.ppc64le.rpm 5.4 MB/s | 858 kB 00:00 (82/175): ca-certificates-2023.2.60_v7.0.306-80 5.2 MB/s | 935 kB 00:00 (83/175): crypto-policies-20230731-1.git3177e06 713 kB/s | 64 kB 00:00 (84/175): coreutils-8.30-15.el8.ppc64le.rpm 7.8 MB/s | 1.3 MB 00:00 (85/175): elfutils-0.189-3.el8.ppc64le.rpm 5.6 MB/s | 570 kB 00:00 (86/175): libatomic-8.5.0-20.el8.ppc64le.rpm 343 kB/s | 26 kB 00:00 (87/175): file-5.33-25.el8.ppc64le.rpm 625 kB/s | 78 kB 00:00 (88/175): libgcc-8.5.0-20.el8.ppc64le.rpm 536 kB/s | 70 kB 00:00 (89/175): pam-1.3.1-27.el8.ppc64le.rpm 4.0 MB/s | 792 kB 00:00 (90/175): libpwquality-1.4.4-6.el8.ppc64le.rpm 484 kB/s | 109 kB 00:00 (91/175): python3-setuptools-wheel-39.2.0-7.el8 2.5 MB/s | 289 kB 00:00 (92/175): which-2.21-20.el8.ppc64le.rpm 637 kB/s | 51 kB 00:00 (93/175): zlib-1.2.11-25.el8.ppc64le.rpm 1.2 MB/s | 113 kB 00:00 (94/175): audit-libs-3.0.7-5.el8.ppc64le.rpm 1.4 MB/s | 136 kB 00:00 (95/175): elfutils-default-yama-scope-0.189-3.e 641 kB/s | 52 kB 00:00 (96/175): chkconfig-1.19.2-1.el8.ppc64le.rpm 1.6 MB/s | 204 kB 00:00 (97/175): elfutils-libelf-0.189-3.el8.ppc64le.r 1.5 MB/s | 239 kB 00:00 (98/175): elfutils-libs-0.189-3.el8.ppc64le.rpm 2.2 MB/s | 332 kB 00:00 (99/175): binutils-2.30-123.el8.ppc64le.rpm 15 MB/s | 6.5 MB 00:00 (100/175): file-libs-5.33-25.el8.ppc64le.rpm 3.5 MB/s | 552 kB 00:00 (101/175): findutils-4.6.0-21.el8.ppc64le.rpm 3.2 MB/s | 542 kB 00:00 (102/175): libasan-8.5.0-20.el8.ppc64le.rpm 3.9 MB/s | 439 kB 00:00 (103/175): krb5-libs-1.18.2-26.el8_9.ppc64le.rp 6.9 MB/s | 909 kB 00:00 (104/175): libcap-2.48-6.el8_9.ppc64le.rpm 693 kB/s | 79 kB 00:00 (105/175): libnghttp2-1.33.0-5.el8_9.ppc64le.rp 768 kB/s | 85 kB 00:00 (106/175): libgomp-8.5.0-20.el8.ppc64le.rpm 1.6 MB/s | 213 kB 00:00 (107/175): libstdc++-8.5.0-20.el8.ppc64le.rpm 3.3 MB/s | 498 kB 00:00 (108/175): libxml2-2.9.7-18.el8_9.ppc64le.rpm 5.2 MB/s | 754 kB 00:00 (109/175): libubsan-8.5.0-20.el8.ppc64le.rpm 1.0 MB/s | 165 kB 00:00 (110/175): ncurses-6.1-10.20180224.el8.ppc64le. 2.3 MB/s | 393 kB 00:00 (111/175): ncurses-base-6.1-10.20180224.el8.noa 841 kB/s | 81 kB 00:00 (112/175): ncurses-libs-6.1-10.20180224.el8.ppc 2.5 MB/s | 361 kB 00:00 (113/175): platform-python-3.6.8-56.el8_9.3.ppc 995 kB/s | 88 kB 00:00 (114/175): openssl-libs-1.1.1k-12.el8_9.ppc64le 9.7 MB/s | 1.5 MB 00:00 (115/175): redhat-release-8.9-0.1.el8.ppc64le.r 522 kB/s | 45 kB 00:00 (116/175): shadow-utils-4.6-19.el8.ppc64le.rpm 8.8 MB/s | 1.2 MB 00:00 (117/175): sqlite-libs-3.26.0-19.el8_9.ppc64le. 3.9 MB/s | 626 kB 00:00 (118/175): python3-libs-3.6.8-56.el8_9.3.ppc64l 18 MB/s | 8.1 MB 00:00 (119/175): systemd-libs-239-78.el8.ppc64le.rpm 4.6 MB/s | 1.1 MB 00:00 (120/175): tpm2-tss-2.3.2-5.el8.ppc64le.rpm 1.2 MB/s | 226 kB 00:00 (121/175): libssh-0.9.6-13.el8_9.ppc64le.rpm 2.5 MB/s | 240 kB 00:00 (122/175): rpm-4.14.3-28.el8_9.ppc64le.rpm 5.1 MB/s | 545 kB 00:00 (123/175): libssh-config-0.9.6-13.el8_9.noarch. 172 kB/s | 21 kB 00:00 (124/175): rpm-build-libs-4.14.3-28.el8_9.ppc64 1.9 MB/s | 166 kB 00:00 (125/175): rpm-libs-4.14.3-28.el8_9.ppc64le.rpm 3.6 MB/s | 381 kB 00:00 (126/175): tzdata-2024a-1.el8.noarch.rpm 4.0 MB/s | 475 kB 00:00 (127/175): glibc-common-2.28-236.el8_9.12.ppc64 8.7 MB/s | 1.0 MB 00:00 (128/175): glibc-2.28-236.el8_9.12.ppc64le.rpm 16 MB/s | 3.4 MB 00:00 (129/175): glibc-devel-2.28-236.el8_9.12.ppc64l 1.3 MB/s | 103 kB 00:00 (130/175): glibc-headers-2.28-236.el8_9.12.ppc6 4.7 MB/s | 489 kB 00:00 (131/175): glibc-gconv-extra-2.28-236.el8_9.12. 10 MB/s | 1.8 MB 00:00 (132/175): curl-7.61.1-33.el8_9.5.ppc64le.rpm 3.2 MB/s | 358 kB 00:00 (133/175): libblkid-2.32.1-44.el8_9.1.ppc64le.r 1.9 MB/s | 242 kB 00:00 (134/175): libcurl-7.61.1-33.el8_9.5.ppc64le.rp 3.3 MB/s | 321 kB 00:00 (135/175): libfdisk-2.32.1-44.el8_9.1.ppc64le.r 3.2 MB/s | 270 kB 00:00 (136/175): libmount-2.32.1-44.el8_9.1.ppc64le.r 3.2 MB/s | 260 kB 00:00 (137/175): kernel-headers-4.18.0-513.24.1.el8_9 20 MB/s | 11 MB 00:00 (138/175): libsmartcols-2.32.1-44.el8_9.1.ppc64 1.7 MB/s | 192 kB 00:00 (139/175): libuuid-2.32.1-44.el8_9.1.ppc64le.rp 967 kB/s | 101 kB 00:00 (140/175): python3-pip-wheel-9.0.3-23.el8_9.1.n 6.4 MB/s | 866 kB 00:00 (141/175): expat-2.2.5-11.el8_9.1.ppc64le.rpm 1.4 MB/s | 115 kB 00:00 (142/175): util-linux-2.32.1-44.el8_9.1.ppc64le 13 MB/s | 2.6 MB 00:00 (143/175): gnutls-3.6.16-8.el8_9.3.ppc64le.rpm 6.8 MB/s | 992 kB 00:00 (144/175): gc-7.6.4-3.el8.ppc64le.rpm 622 kB/s | 115 kB 00:00 (145/175): libatomic_ops-7.6.2-3.el8.ppc64le.rp 247 kB/s | 38 kB 00:00 (146/175): glibc-all-langpacks-2.28-236.el8_9.1 16 MB/s | 26 MB 00:01 (147/175): isl-0.16.1-6.el8.ppc64le.rpm 6.2 MB/s | 1.2 MB 00:00 (148/175): ghc-srpm-macros-1.4.2-7.el8.noarch.r 99 kB/s | 9.4 kB 00:00 (149/175): ocaml-srpm-macros-5-4.el8.noarch.rpm 142 kB/s | 9.5 kB 00:00 (150/175): openblas-srpm-macros-2-2.el8.noarch. 118 kB/s | 8.0 kB 00:00 (151/175): guile-2.0.14-7.el8.ppc64le.rpm 11 MB/s | 3.5 MB 00:00 (152/175): perl-srpm-macros-1-25.el8.noarch.rpm 83 kB/s | 11 kB 00:00 (153/175): rust-srpm-macros-5-2.el8.noarch.rpm 69 kB/s | 9.3 kB 00:00 (154/175): zstd-1.4.4-1.el8.ppc64le.rpm 3.5 MB/s | 346 kB 00:00 (155/175): efi-srpm-macros-3-3.el8.noarch.rpm 250 kB/s | 22 kB 00:00 (156/175): libmpc-1.1.0-9.1.el8.ppc64le.rpm 453 kB/s | 65 kB 00:00 (157/175): go-srpm-macros-2-17.el8.noarch.rpm 84 kB/s | 13 kB 00:00 (158/175): dwz-0.12-10.el8.ppc64le.rpm 871 kB/s | 114 kB 00:00 (159/175): python-rpm-macros-3-45.el8.noarch.rp 222 kB/s | 16 kB 00:00 (160/175): python3-rpm-macros-3-45.el8.noarch.r 183 kB/s | 15 kB 00:00 (161/175): qt5-srpm-macros-5.15.3-1.el8.noarch. 116 kB/s | 11 kB 00:00 (162/175): python-srpm-macros-3-45.el8.noarch.r 197 kB/s | 16 kB 00:00 (163/175): redhat-rpm-config-131-1.el8.noarch.r 882 kB/s | 91 kB 00:00 (164/175): annobin-11.13-2.el8.ppc64le.rpm 5.5 MB/s | 974 kB 00:00 (165/175): gcc-plugin-annobin-8.5.0-20.el8.ppc6 378 kB/s | 38 kB 00:00 (166/175): gdb-headless-8.2-20.el8.ppc64le.rpm 12 MB/s | 3.5 MB 00:00 (167/175): gcc-c++-8.5.0-20.el8.ppc64le.rpm 18 MB/s | 13 MB 00:00 (168/175): libstdc++-devel-8.5.0-20.el8.ppc64le 8.2 MB/s | 2.1 MB 00:00 (169/175): rpm-build-4.14.3-28.el8_9.ppc64le.rp 1.8 MB/s | 177 kB 00:00 (170/175): gcc-8.5.0-20.el8.ppc64le.rpm 20 MB/s | 21 MB 00:01 (171/175): ansible-srpm-macros-1-12.el8.noarch. 100 kB/s | 21 kB 00:00 (172/175): fpc-srpm-macros-1.3-1.el8.noarch.rpm 178 kB/s | 8.2 kB 00:00 (173/175): epel-rpm-macros-8-41.noarch.rpm 103 kB/s | 27 kB 00:00 (174/175): lua-srpm-macros-1-13.el8.noarch.rpm 33 kB/s | 9.2 kB 00:00 (175/175): cpp-8.5.0-20.el8.ppc64le.rpm 15 MB/s | 11 MB 00:00 -------------------------------------------------------------------------------- Total 16 MB/s | 167 MB 00:10 Red Hat Enterprise Linux - BaseOS 3.1 MB/s | 3.1 kB 00:00 Importing GPG key 0xFD431D51: Userid : "Red Hat, Inc. (release key 2) " Fingerprint: 567E 347A D004 4ADE 55BA 8A5F 199E 2F91 FD43 1D51 From : /usr/share/distribution-gpg-keys/redhat/RPM-GPG-KEY-redhat8-release Key imported successfully Importing GPG key 0x2FA658E0: Userid : "Red Hat, Inc. (auxiliary key) " Fingerprint: 43A6 E49C 4A38 F4BE 9ABF 2A53 4568 9C88 2FA6 58E0 From : /usr/share/distribution-gpg-keys/redhat/RPM-GPG-KEY-redhat8-release Key imported successfully Extra Packages for Enterprise Linux 8 - ppc64le 1.6 MB/s | 1.6 kB 00:00 Importing GPG key 0x2F86D6A1: Userid : "Fedora EPEL (8) " Fingerprint: 94E2 79EB 8D8F 25B2 1810 ADF1 21EA 45AB 2F86 D6A1 From : /usr/share/distribution-gpg-keys/epel/RPM-GPG-KEY-EPEL-8 Key imported successfully Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Running scriptlet: filesystem-3.8-6.el8.ppc64le 1/1 Preparing : 1/1 Installing : libgcc-8.5.0-20.el8.ppc64le 1/175 Running scriptlet: libgcc-8.5.0-20.el8.ppc64le 1/175 Installing : python-srpm-macros-3-45.el8.noarch 2/175 Installing : crypto-policies-20230731-1.git3177e06.el8.noarch 3/175 Running scriptlet: crypto-policies-20230731-1.git3177e06.el8.noarch 3/175 Installing : python-rpm-macros-3-45.el8.noarch 4/175 Installing : python3-pip-wheel-9.0.3-23.el8_9.1.noarch 5/175 Installing : redhat-release-8.9-0.1.el8.ppc64le 6/175 Installing : setup-2.12.2-9.el8.noarch 7/175 warning: /etc/hosts created as /etc/hosts.rpmnew Running scriptlet: setup-2.12.2-9.el8.noarch 7/175 Installing : filesystem-3.8-6.el8.ppc64le 8/175 Installing : python3-setuptools-wheel-39.2.0-7.el8.noarch 9/175 Installing : basesystem-11-5.el8.noarch 10/175 Installing : python3-rpm-macros-3-45.el8.noarch 11/175 Installing : fpc-srpm-macros-1.3-1.el8.noarch 12/175 Installing : ansible-srpm-macros-1-12.el8.noarch 13/175 Installing : qt5-srpm-macros-5.15.3-1.el8.noarch 14/175 Installing : go-srpm-macros-2-17.el8.noarch 15/175 Installing : rust-srpm-macros-5-2.el8.noarch 16/175 Installing : perl-srpm-macros-1-25.el8.noarch 17/175 Installing : openblas-srpm-macros-2-2.el8.noarch 18/175 Installing : ocaml-srpm-macros-5-4.el8.noarch 19/175 Installing : ghc-srpm-macros-1.4.2-7.el8.noarch 20/175 Installing : kernel-headers-4.18.0-513.24.1.el8_9.ppc64le 21/175 Installing : tzdata-2024a-1.el8.noarch 22/175 Installing : libssh-config-0.9.6-13.el8_9.noarch 23/175 Installing : ncurses-base-6.1-10.20180224.el8.noarch 24/175 Installing : pcre2-10.32-3.el8_6.ppc64le 25/175 Installing : libselinux-2.9-8.el8.ppc64le 26/175 Installing : ncurses-libs-6.1-10.20180224.el8.ppc64le 27/175 Installing : glibc-all-langpacks-2.28-236.el8_9.12.ppc64le 28/175 Installing : glibc-common-2.28-236.el8_9.12.ppc64le 29/175 Installing : glibc-gconv-extra-2.28-236.el8_9.12.ppc64le 30/175 Running scriptlet: glibc-gconv-extra-2.28-236.el8_9.12.ppc64le 30/175 Running scriptlet: glibc-2.28-236.el8_9.12.ppc64le 31/175 Installing : glibc-2.28-236.el8_9.12.ppc64le 31/175 Running scriptlet: glibc-2.28-236.el8_9.12.ppc64le 31/175 Installing : bash-4.4.20-4.el8_6.ppc64le 32/175 Running scriptlet: bash-4.4.20-4.el8_6.ppc64le 32/175 Installing : libsepol-2.9-3.el8.ppc64le 33/175 Running scriptlet: libsepol-2.9-3.el8.ppc64le 33/175 Installing : zlib-1.2.11-25.el8.ppc64le 34/175 Installing : info-6.5-7.el8.ppc64le 35/175 Installing : bzip2-libs-1.0.6-26.el8.ppc64le 36/175 Installing : xz-libs-5.2.4-4.el8_6.ppc64le 37/175 Installing : gmp-1:6.1.2-10.el8.ppc64le 38/175 Running scriptlet: gmp-1:6.1.2-10.el8.ppc64le 38/175 Installing : libstdc++-8.5.0-20.el8.ppc64le 39/175 Running scriptlet: libstdc++-8.5.0-20.el8.ppc64le 39/175 Installing : libzstd-1.4.4-1.el8.ppc64le 40/175 Installing : elfutils-libelf-0.189-3.el8.ppc64le 41/175 Installing : libxcrypt-4.1.1-6.el8.ppc64le 42/175 Installing : mpfr-3.1.6-1.el8.ppc64le 43/175 Running scriptlet: mpfr-3.1.6-1.el8.ppc64le 43/175 Installing : readline-7.0-10.el8.ppc64le 44/175 Running scriptlet: readline-7.0-10.el8.ppc64le 44/175 Installing : sqlite-libs-3.26.0-19.el8_9.ppc64le 45/175 Installing : popt-1.18-1.el8.ppc64le 46/175 Installing : libcap-2.48-6.el8_9.ppc64le 47/175 Installing : libcom_err-1.45.6-5.el8.ppc64le 48/175 Running scriptlet: libcom_err-1.45.6-5.el8.ppc64le 48/175 Installing : libuuid-2.32.1-44.el8_9.1.ppc64le 49/175 Running scriptlet: libuuid-2.32.1-44.el8_9.1.ppc64le 49/175 Installing : chkconfig-1.19.2-1.el8.ppc64le 50/175 Installing : libunistring-0.9.9-3.el8.ppc64le 51/175 Installing : libattr-2.4.48-3.el8.ppc64le 52/175 Installing : libacl-2.2.53-1.el8.ppc64le 53/175 Installing : sed-4.5-5.el8.ppc64le 54/175 Running scriptlet: sed-4.5-5.el8.ppc64le 54/175 Installing : libgpg-error-1.31-1.el8.ppc64le 55/175 Installing : lua-libs-5.3.4-12.el8.ppc64le 56/175 Installing : libffi-3.1-24.el8.ppc64le 57/175 Installing : p11-kit-0.23.22-1.el8.ppc64le 58/175 Installing : libidn2-2.2.0-1.el8.ppc64le 59/175 Installing : libmpc-1.1.0-9.1.el8.ppc64le 60/175 Installing : file-libs-5.33-25.el8.ppc64le 61/175 Installing : file-5.33-25.el8.ppc64le 62/175 Installing : libgcrypt-1.8.5-7.el8_6.ppc64le 63/175 Running scriptlet: libgcrypt-1.8.5-7.el8_6.ppc64le 63/175 Installing : unzip-6.0-46.el8.ppc64le 64/175 Installing : findutils-1:4.6.0-21.el8.ppc64le 65/175 Running scriptlet: findutils-1:4.6.0-21.el8.ppc64le 65/175 Installing : elfutils-default-yama-scope-0.189-3.el8.noarch 66/175 Running scriptlet: elfutils-default-yama-scope-0.189-3.el8.noarch 66/175 Installing : elfutils-libs-0.189-3.el8.ppc64le 67/175 Running scriptlet: glibc-headers-2.28-236.el8_9.12.ppc64le 68/175 Installing : glibc-headers-2.28-236.el8_9.12.ppc64le 68/175 Installing : lz4-libs-1.8.3-3.el8_4.ppc64le 69/175 Installing : pcre-8.42-6.el8.ppc64le 70/175 Installing : grep-3.1-6.el8.ppc64le 71/175 Running scriptlet: grep-3.1-6.el8.ppc64le 71/175 Installing : keyutils-libs-1.5.10-9.el8.ppc64le 72/175 Installing : libcap-ng-0.7.11-1.el8.ppc64le 73/175 Installing : audit-libs-3.0.7-5.el8.ppc64le 74/175 Installing : gdbm-libs-1:1.18-2.el8.ppc64le 75/175 Installing : libtasn1-4.13-4.el8_7.ppc64le 76/175 Running scriptlet: libtasn1-4.13-4.el8_7.ppc64le 76/175 Installing : p11-kit-trust-0.23.22-1.el8.ppc64le 77/175 Running scriptlet: p11-kit-trust-0.23.22-1.el8.ppc64le 77/175 Installing : expat-2.2.5-11.el8_9.1.ppc64le 78/175 Installing : gdbm-1:1.18-2.el8.ppc64le 79/175 Installing : libsemanage-2.9-9.el8_6.ppc64le 80/175 Installing : xz-5.2.4-4.el8_6.ppc64le 81/175 Installing : elfutils-0.189-3.el8.ppc64le 82/175 Installing : zip-3.0-23.el8.ppc64le 83/175 Installing : cpp-8.5.0-20.el8.ppc64le 84/175 Running scriptlet: cpp-8.5.0-20.el8.ppc64le 84/175 Installing : libassuan-2.5.1-3.el8.ppc64le 85/175 Installing : libksba-1.3.5-9.el8_7.ppc64le 86/175 Installing : tar-2:1.30-9.el8.ppc64le 87/175 Running scriptlet: tar-2:1.30-9.el8.ppc64le 87/175 Installing : patch-2.7.6-11.el8.ppc64le 88/175 Installing : dwz-0.12-10.el8.ppc64le 89/175 Installing : libasan-8.5.0-20.el8.ppc64le 90/175 Running scriptlet: libasan-8.5.0-20.el8.ppc64le 90/175 Installing : libubsan-8.5.0-20.el8.ppc64le 91/175 Running scriptlet: libubsan-8.5.0-20.el8.ppc64le 91/175 Installing : libstdc++-devel-8.5.0-20.el8.ppc64le 92/175 Installing : nettle-3.4.1-7.el8.ppc64le 93/175 Running scriptlet: nettle-3.4.1-7.el8.ppc64le 93/175 Installing : gnutls-3.6.16-8.el8_9.3.ppc64le 94/175 Installing : isl-0.16.1-6.el8.ppc64le 95/175 Running scriptlet: isl-0.16.1-6.el8.ppc64le 95/175 Installing : libxml2-2.9.7-18.el8_9.ppc64le 96/175 Installing : bzip2-1.0.6-26.el8.ppc64le 97/175 Installing : diffutils-3.6-6.el8.ppc64le 98/175 Running scriptlet: diffutils-3.6-6.el8.ppc64le 98/175 Installing : coreutils-common-8.30-15.el8.ppc64le 99/175 Running scriptlet: coreutils-common-8.30-15.el8.ppc64le 99/175 Installing : libatomic-8.5.0-20.el8.ppc64le 100/175 Running scriptlet: libatomic-8.5.0-20.el8.ppc64le 100/175 Installing : libgomp-8.5.0-20.el8.ppc64le 101/175 Running scriptlet: libgomp-8.5.0-20.el8.ppc64le 101/175 Installing : zstd-1.4.4-1.el8.ppc64le 102/175 Installing : libpkgconf-1.4.2-1.el8.ppc64le 103/175 Installing : pkgconf-1.4.2-1.el8.ppc64le 104/175 Installing : librtas-2.0.2-1.el8.ppc64le 105/175 Running scriptlet: librtas-2.0.2-1.el8.ppc64le 105/175 Installing : libsigsegv-2.11-5.el8.ppc64le 106/175 Installing : gawk-4.2.1-4.el8.ppc64le 107/175 Installing : libtool-ltdl-2.4.6-25.el8.ppc64le 108/175 Running scriptlet: libtool-ltdl-2.4.6-25.el8.ppc64le 108/175 Installing : npth-1.5-4.el8.ppc64le 109/175 Installing : brotli-1.0.6-3.el8.ppc64le 110/175 Installing : cpio-2.12-11.el8.ppc64le 111/175 Installing : libverto-0.3.2-2.el8.ppc64le 112/175 Installing : libnghttp2-1.33.0-5.el8_9.ppc64le 113/175 Installing : ncurses-6.1-10.20180224.el8.ppc64le 114/175 Installing : openssl-libs-1:1.1.1k-12.el8_9.ppc64le 115/175 Running scriptlet: openssl-libs-1:1.1.1k-12.el8_9.ppc64le 115/175 Installing : coreutils-8.30-15.el8.ppc64le 116/175 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 117/175 Installing : ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 117/175 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 117/175 Installing : libdb-5.3.28-42.el8_4.ppc64le 118/175 Running scriptlet: libdb-5.3.28-42.el8_4.ppc64le 118/175 Installing : krb5-libs-1.18.2-26.el8_9.ppc64le 119/175 Installing : libtirpc-1.1.4-8.el8.ppc64le 120/175 Running scriptlet: libtirpc-1.1.4-8.el8.ppc64le 120/175 Installing : libblkid-2.32.1-44.el8_9.1.ppc64le 121/175 Running scriptlet: libblkid-2.32.1-44.el8_9.1.ppc64le 121/175 Installing : libmount-2.32.1-44.el8_9.1.ppc64le 122/175 Running scriptlet: libmount-2.32.1-44.el8_9.1.ppc64le 122/175 Installing : systemd-libs-239-78.el8.ppc64le 123/175 Running scriptlet: systemd-libs-239-78.el8.ppc64le 123/175 Installing : libnsl2-1.2.0-2.20180605git4a062cf.el8.ppc64le 124/175 Running scriptlet: libnsl2-1.2.0-2.20180605git4a062cf.el8.ppc64le 124/175 Installing : platform-python-setuptools-39.2.0-7.el8.noarch 125/175 Installing : platform-python-3.6.8-56.el8_9.3.ppc64le 126/175 Running scriptlet: platform-python-3.6.8-56.el8_9.3.ppc64le 126/175 Installing : python3-libs-3.6.8-56.el8_9.3.ppc64le 127/175 Installing : gzip-1.9-13.el8_5.ppc64le 128/175 Running scriptlet: gzip-1.9-13.el8_5.ppc64le 128/175 Installing : cracklib-2.9.6-15.el8.ppc64le 129/175 Installing : cracklib-dicts-2.9.6-15.el8.ppc64le 130/175 Installing : binutils-2.30-123.el8.ppc64le 131/175 Running scriptlet: binutils-2.30-123.el8.ppc64le 131/175 Installing : shadow-utils-2:4.6-19.el8.ppc64le 132/175 Running scriptlet: libutempter-1.1.6-14.el8.ppc64le 133/175 Installing : libutempter-1.1.6-14.el8.ppc64le 133/175 Running scriptlet: tpm2-tss-2.3.2-5.el8.ppc64le 134/175 Installing : tpm2-tss-2.3.2-5.el8.ppc64le 134/175 Running scriptlet: tpm2-tss-2.3.2-5.el8.ppc64le 134/175 Installing : ima-evm-utils-1.3.2-12.el8.ppc64le 135/175 Installing : libpwquality-1.4.4-6.el8.ppc64le 136/175 Installing : pam-1.3.1-27.el8.ppc64le 137/175 Running scriptlet: pam-1.3.1-27.el8.ppc64le 137/175 Installing : libusbx-1.0.23-4.el8.ppc64le 138/175 Installing : glib2-2.56.4-161.el8.ppc64le 139/175 Installing : libbabeltrace-1.5.4-4.el8.ppc64le 140/175 Running scriptlet: libbabeltrace-1.5.4-4.el8.ppc64le 140/175 Installing : libfdisk-2.32.1-44.el8_9.1.ppc64le 141/175 Running scriptlet: libfdisk-2.32.1-44.el8_9.1.ppc64le 141/175 Installing : cyrus-sasl-lib-2.1.27-6.el8_5.ppc64le 142/175 Running scriptlet: cyrus-sasl-lib-2.1.27-6.el8_5.ppc64le 142/175 Installing : openldap-2.4.46-18.el8.ppc64le 143/175 Installing : gnupg2-2.2.20-3.el8_6.ppc64le 144/175 Installing : libssh-0.9.6-13.el8_9.ppc64le 145/175 Installing : libdb-utils-5.3.28-42.el8_4.ppc64le 146/175 Installing : libarchive-3.3.3-5.el8.ppc64le 147/175 Installing : libsmartcols-2.32.1-44.el8_9.1.ppc64le 148/175 Running scriptlet: libsmartcols-2.32.1-44.el8_9.1.ppc64le 148/175 Installing : libatomic_ops-7.6.2-3.el8.ppc64le 149/175 Installing : gc-7.6.4-3.el8.ppc64le 150/175 Installing : guile-5:2.0.14-7.el8.ppc64le 151/175 Running scriptlet: guile-5:2.0.14-7.el8.ppc64le 151/175 Installing : publicsuffix-list-dafsa-20180723-1.el8.noarch 152/175 Installing : libpsl-0.20.2-6.el8.ppc64le 153/175 Installing : libcurl-7.61.1-33.el8_9.5.ppc64le 154/175 Installing : curl-7.61.1-33.el8_9.5.ppc64le 155/175 Installing : rpm-4.14.3-28.el8_9.ppc64le 156/175 Installing : rpm-libs-4.14.3-28.el8_9.ppc64le 157/175 Running scriptlet: rpm-libs-4.14.3-28.el8_9.ppc64le 157/175 Installing : rpm-build-libs-4.14.3-28.el8_9.ppc64le 158/175 Running scriptlet: rpm-build-libs-4.14.3-28.el8_9.ppc64le 158/175 Installing : gdb-headless-8.2-20.el8.ppc64le 159/175 Installing : efi-srpm-macros-3-3.el8.noarch 160/175 Installing : lua-srpm-macros-1-13.el8.noarch 161/175 Installing : pkgconf-m4-1.4.2-1.el8.noarch 162/175 Installing : pkgconf-pkg-config-1.4.2-1.el8.ppc64le 163/175 Installing : glibc-devel-2.28-236.el8_9.12.ppc64le 164/175 Running scriptlet: glibc-devel-2.28-236.el8_9.12.ppc64le 164/175 Installing : libxcrypt-devel-4.1.1-6.el8.ppc64le 165/175 Installing : gcc-8.5.0-20.el8.ppc64le 166/175 Running scriptlet: gcc-8.5.0-20.el8.ppc64le 166/175 Installing : annobin-11.13-2.el8.ppc64le 167/175 Installing : gcc-plugin-annobin-8.5.0-20.el8.ppc64le 168/175 Installing : redhat-rpm-config-131-1.el8.noarch 169/175 Running scriptlet: redhat-rpm-config-131-1.el8.noarch 169/175 Installing : rpm-build-4.14.3-28.el8_9.ppc64le 170/175 Installing : gcc-c++-8.5.0-20.el8.ppc64le 171/175 Installing : epel-rpm-macros-8-41.noarch 172/175 Installing : util-linux-2.32.1-44.el8_9.1.ppc64le 173/175 Running scriptlet: util-linux-2.32.1-44.el8_9.1.ppc64le 173/175 Installing : which-2.21-20.el8.ppc64le 174/175 Installing : make-1:4.2.1-11.el8.ppc64le 175/175 Running scriptlet: make-1:4.2.1-11.el8.ppc64le 175/175 Running scriptlet: filesystem-3.8-6.el8.ppc64le 175/175 Running scriptlet: glibc-all-langpacks-2.28-236.el8_9.12.ppc64le 175/175 Running scriptlet: ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 175/175 Running scriptlet: guile-5:2.0.14-7.el8.ppc64le 175/175 Running scriptlet: glibc-common-2.28-236.el8_9.12.ppc64le 175/175 Running scriptlet: info-6.5-7.el8.ppc64le 175/175 Running scriptlet: glib2-2.56.4-161.el8.ppc64le 175/175 Verifying : bzip2-1.0.6-26.el8.ppc64le 1/175 Verifying : bzip2-libs-1.0.6-26.el8.ppc64le 2/175 Verifying : cracklib-2.9.6-15.el8.ppc64le 3/175 Verifying : cracklib-dicts-2.9.6-15.el8.ppc64le 4/175 Verifying : grep-3.1-6.el8.ppc64le 5/175 Verifying : libacl-2.2.53-1.el8.ppc64le 6/175 Verifying : libassuan-2.5.1-3.el8.ppc64le 7/175 Verifying : libattr-2.4.48-3.el8.ppc64le 8/175 Verifying : libgpg-error-1.31-1.el8.ppc64le 9/175 Verifying : libnsl2-1.2.0-2.20180605git4a062cf.el8.ppc64le 10/175 Verifying : libpkgconf-1.4.2-1.el8.ppc64le 11/175 Verifying : librtas-2.0.2-1.el8.ppc64le 12/175 Verifying : libsigsegv-2.11-5.el8.ppc64le 13/175 Verifying : libtool-ltdl-2.4.6-25.el8.ppc64le 14/175 Verifying : libunistring-0.9.9-3.el8.ppc64le 15/175 Verifying : libutempter-1.1.6-14.el8.ppc64le 16/175 Verifying : mpfr-3.1.6-1.el8.ppc64le 17/175 Verifying : npth-1.5-4.el8.ppc64le 18/175 Verifying : pkgconf-1.4.2-1.el8.ppc64le 19/175 Verifying : pkgconf-pkg-config-1.4.2-1.el8.ppc64le 20/175 Verifying : readline-7.0-10.el8.ppc64le 21/175 Verifying : zip-3.0-23.el8.ppc64le 22/175 Verifying : basesystem-11-5.el8.noarch 23/175 Verifying : pkgconf-m4-1.4.2-1.el8.noarch 24/175 Verifying : publicsuffix-list-dafsa-20180723-1.el8.noarch 25/175 Verifying : gmp-1:6.1.2-10.el8.ppc64le 26/175 Verifying : libidn2-2.2.0-1.el8.ppc64le 27/175 Verifying : diffutils-3.6-6.el8.ppc64le 28/175 Verifying : patch-2.7.6-11.el8.ppc64le 29/175 Verifying : libpsl-0.20.2-6.el8.ppc64le 30/175 Verifying : libusbx-1.0.23-4.el8.ppc64le 31/175 Verifying : libzstd-1.4.4-1.el8.ppc64le 32/175 Verifying : brotli-1.0.6-3.el8.ppc64le 33/175 Verifying : ima-evm-utils-1.3.2-12.el8.ppc64le 34/175 Verifying : p11-kit-0.23.22-1.el8.ppc64le 35/175 Verifying : p11-kit-trust-0.23.22-1.el8.ppc64le 36/175 Verifying : popt-1.18-1.el8.ppc64le 37/175 Verifying : libdb-5.3.28-42.el8_4.ppc64le 38/175 Verifying : libdb-utils-5.3.28-42.el8_4.ppc64le 39/175 Verifying : libxcrypt-devel-4.1.1-6.el8.ppc64le 40/175 Verifying : lua-libs-5.3.4-12.el8.ppc64le 41/175 Verifying : lz4-libs-1.8.3-3.el8_4.ppc64le 42/175 Verifying : openldap-2.4.46-18.el8.ppc64le 43/175 Verifying : pcre-8.42-6.el8.ppc64le 44/175 Verifying : cyrus-sasl-lib-2.1.27-6.el8_5.ppc64le 45/175 Verifying : filesystem-3.8-6.el8.ppc64le 46/175 Verifying : keyutils-libs-1.5.10-9.el8.ppc64le 47/175 Verifying : libcap-ng-0.7.11-1.el8.ppc64le 48/175 Verifying : libsepol-2.9-3.el8.ppc64le 49/175 Verifying : libxcrypt-4.1.1-6.el8.ppc64le 50/175 Verifying : nettle-3.4.1-7.el8.ppc64le 51/175 Verifying : cpio-2.12-11.el8.ppc64le 52/175 Verifying : gawk-4.2.1-4.el8.ppc64le 53/175 Verifying : gzip-1.9-13.el8_5.ppc64le 54/175 Verifying : info-6.5-7.el8.ppc64le 55/175 Verifying : make-1:4.2.1-11.el8.ppc64le 56/175 Verifying : sed-4.5-5.el8.ppc64le 57/175 Verifying : unzip-6.0-46.el8.ppc64le 58/175 Verifying : xz-5.2.4-4.el8_6.ppc64le 59/175 Verifying : xz-libs-5.2.4-4.el8_6.ppc64le 60/175 Verifying : bash-4.4.20-4.el8_6.ppc64le 61/175 Verifying : gdbm-1:1.18-2.el8.ppc64le 62/175 Verifying : gdbm-libs-1:1.18-2.el8.ppc64le 63/175 Verifying : gnupg2-2.2.20-3.el8_6.ppc64le 64/175 Verifying : libbabeltrace-1.5.4-4.el8.ppc64le 65/175 Verifying : libcom_err-1.45.6-5.el8.ppc64le 66/175 Verifying : libgcrypt-1.8.5-7.el8_6.ppc64le 67/175 Verifying : libsemanage-2.9-9.el8_6.ppc64le 68/175 Verifying : libtirpc-1.1.4-8.el8.ppc64le 69/175 Verifying : libverto-0.3.2-2.el8.ppc64le 70/175 Verifying : pcre2-10.32-3.el8_6.ppc64le 71/175 Verifying : coreutils-common-8.30-15.el8.ppc64le 72/175 Verifying : glib2-2.56.4-161.el8.ppc64le 73/175 Verifying : libarchive-3.3.3-5.el8.ppc64le 74/175 Verifying : libffi-3.1-24.el8.ppc64le 75/175 Verifying : libksba-1.3.5-9.el8_7.ppc64le 76/175 Verifying : libselinux-2.9-8.el8.ppc64le 77/175 Verifying : libtasn1-4.13-4.el8_7.ppc64le 78/175 Verifying : platform-python-setuptools-39.2.0-7.el8.noarch 79/175 Verifying : setup-2.12.2-9.el8.noarch 80/175 Verifying : tar-2:1.30-9.el8.ppc64le 81/175 Verifying : ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.no 82/175 Verifying : coreutils-8.30-15.el8.ppc64le 83/175 Verifying : crypto-policies-20230731-1.git3177e06.el8.noarch 84/175 Verifying : elfutils-0.189-3.el8.ppc64le 85/175 Verifying : file-5.33-25.el8.ppc64le 86/175 Verifying : libatomic-8.5.0-20.el8.ppc64le 87/175 Verifying : libgcc-8.5.0-20.el8.ppc64le 88/175 Verifying : libpwquality-1.4.4-6.el8.ppc64le 89/175 Verifying : pam-1.3.1-27.el8.ppc64le 90/175 Verifying : python3-setuptools-wheel-39.2.0-7.el8.noarch 91/175 Verifying : which-2.21-20.el8.ppc64le 92/175 Verifying : zlib-1.2.11-25.el8.ppc64le 93/175 Verifying : audit-libs-3.0.7-5.el8.ppc64le 94/175 Verifying : binutils-2.30-123.el8.ppc64le 95/175 Verifying : chkconfig-1.19.2-1.el8.ppc64le 96/175 Verifying : elfutils-default-yama-scope-0.189-3.el8.noarch 97/175 Verifying : elfutils-libelf-0.189-3.el8.ppc64le 98/175 Verifying : elfutils-libs-0.189-3.el8.ppc64le 99/175 Verifying : file-libs-5.33-25.el8.ppc64le 100/175 Verifying : findutils-1:4.6.0-21.el8.ppc64le 101/175 Verifying : krb5-libs-1.18.2-26.el8_9.ppc64le 102/175 Verifying : libasan-8.5.0-20.el8.ppc64le 103/175 Verifying : libcap-2.48-6.el8_9.ppc64le 104/175 Verifying : libgomp-8.5.0-20.el8.ppc64le 105/175 Verifying : libnghttp2-1.33.0-5.el8_9.ppc64le 106/175 Verifying : libstdc++-8.5.0-20.el8.ppc64le 107/175 Verifying : libubsan-8.5.0-20.el8.ppc64le 108/175 Verifying : libxml2-2.9.7-18.el8_9.ppc64le 109/175 Verifying : ncurses-6.1-10.20180224.el8.ppc64le 110/175 Verifying : ncurses-base-6.1-10.20180224.el8.noarch 111/175 Verifying : ncurses-libs-6.1-10.20180224.el8.ppc64le 112/175 Verifying : openssl-libs-1:1.1.1k-12.el8_9.ppc64le 113/175 Verifying : platform-python-3.6.8-56.el8_9.3.ppc64le 114/175 Verifying : python3-libs-3.6.8-56.el8_9.3.ppc64le 115/175 Verifying : redhat-release-8.9-0.1.el8.ppc64le 116/175 Verifying : shadow-utils-2:4.6-19.el8.ppc64le 117/175 Verifying : sqlite-libs-3.26.0-19.el8_9.ppc64le 118/175 Verifying : systemd-libs-239-78.el8.ppc64le 119/175 Verifying : tpm2-tss-2.3.2-5.el8.ppc64le 120/175 Verifying : libssh-0.9.6-13.el8_9.ppc64le 121/175 Verifying : libssh-config-0.9.6-13.el8_9.noarch 122/175 Verifying : rpm-4.14.3-28.el8_9.ppc64le 123/175 Verifying : rpm-build-libs-4.14.3-28.el8_9.ppc64le 124/175 Verifying : rpm-libs-4.14.3-28.el8_9.ppc64le 125/175 Verifying : tzdata-2024a-1.el8.noarch 126/175 Verifying : glibc-2.28-236.el8_9.12.ppc64le 127/175 Verifying : glibc-all-langpacks-2.28-236.el8_9.12.ppc64le 128/175 Verifying : glibc-common-2.28-236.el8_9.12.ppc64le 129/175 Verifying : glibc-devel-2.28-236.el8_9.12.ppc64le 130/175 Verifying : glibc-gconv-extra-2.28-236.el8_9.12.ppc64le 131/175 Verifying : glibc-headers-2.28-236.el8_9.12.ppc64le 132/175 Verifying : curl-7.61.1-33.el8_9.5.ppc64le 133/175 Verifying : kernel-headers-4.18.0-513.24.1.el8_9.ppc64le 134/175 Verifying : libblkid-2.32.1-44.el8_9.1.ppc64le 135/175 Verifying : libcurl-7.61.1-33.el8_9.5.ppc64le 136/175 Verifying : libfdisk-2.32.1-44.el8_9.1.ppc64le 137/175 Verifying : libmount-2.32.1-44.el8_9.1.ppc64le 138/175 Verifying : libsmartcols-2.32.1-44.el8_9.1.ppc64le 139/175 Verifying : libuuid-2.32.1-44.el8_9.1.ppc64le 140/175 Verifying : python3-pip-wheel-9.0.3-23.el8_9.1.noarch 141/175 Verifying : util-linux-2.32.1-44.el8_9.1.ppc64le 142/175 Verifying : expat-2.2.5-11.el8_9.1.ppc64le 143/175 Verifying : gnutls-3.6.16-8.el8_9.3.ppc64le 144/175 Verifying : gc-7.6.4-3.el8.ppc64le 145/175 Verifying : libatomic_ops-7.6.2-3.el8.ppc64le 146/175 Verifying : isl-0.16.1-6.el8.ppc64le 147/175 Verifying : guile-5:2.0.14-7.el8.ppc64le 148/175 Verifying : ghc-srpm-macros-1.4.2-7.el8.noarch 149/175 Verifying : ocaml-srpm-macros-5-4.el8.noarch 150/175 Verifying : openblas-srpm-macros-2-2.el8.noarch 151/175 Verifying : perl-srpm-macros-1-25.el8.noarch 152/175 Verifying : rust-srpm-macros-5-2.el8.noarch 153/175 Verifying : zstd-1.4.4-1.el8.ppc64le 154/175 Verifying : efi-srpm-macros-3-3.el8.noarch 155/175 Verifying : go-srpm-macros-2-17.el8.noarch 156/175 Verifying : libmpc-1.1.0-9.1.el8.ppc64le 157/175 Verifying : dwz-0.12-10.el8.ppc64le 158/175 Verifying : qt5-srpm-macros-5.15.3-1.el8.noarch 159/175 Verifying : python-rpm-macros-3-45.el8.noarch 160/175 Verifying : python3-rpm-macros-3-45.el8.noarch 161/175 Verifying : redhat-rpm-config-131-1.el8.noarch 162/175 Verifying : python-srpm-macros-3-45.el8.noarch 163/175 Verifying : annobin-11.13-2.el8.ppc64le 164/175 Verifying : gcc-8.5.0-20.el8.ppc64le 165/175 Verifying : gcc-c++-8.5.0-20.el8.ppc64le 166/175 Verifying : gcc-plugin-annobin-8.5.0-20.el8.ppc64le 167/175 Verifying : gdb-headless-8.2-20.el8.ppc64le 168/175 Verifying : libstdc++-devel-8.5.0-20.el8.ppc64le 169/175 Verifying : cpp-8.5.0-20.el8.ppc64le 170/175 Verifying : rpm-build-4.14.3-28.el8_9.ppc64le 171/175 Verifying : ansible-srpm-macros-1-12.el8.noarch 172/175 Verifying : epel-rpm-macros-8-41.noarch 173/175 Verifying : fpc-srpm-macros-1.3-1.el8.noarch 174/175 Verifying : lua-srpm-macros-1-13.el8.noarch 175/175 Installed products updated. Installed: annobin-11.13-2.el8.ppc64le ansible-srpm-macros-1-12.el8.noarch audit-libs-3.0.7-5.el8.ppc64le basesystem-11-5.el8.noarch bash-4.4.20-4.el8_6.ppc64le binutils-2.30-123.el8.ppc64le brotli-1.0.6-3.el8.ppc64le bzip2-1.0.6-26.el8.ppc64le bzip2-libs-1.0.6-26.el8.ppc64le ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.noarch chkconfig-1.19.2-1.el8.ppc64le coreutils-8.30-15.el8.ppc64le coreutils-common-8.30-15.el8.ppc64le cpio-2.12-11.el8.ppc64le cpp-8.5.0-20.el8.ppc64le cracklib-2.9.6-15.el8.ppc64le cracklib-dicts-2.9.6-15.el8.ppc64le crypto-policies-20230731-1.git3177e06.el8.noarch curl-7.61.1-33.el8_9.5.ppc64le cyrus-sasl-lib-2.1.27-6.el8_5.ppc64le diffutils-3.6-6.el8.ppc64le dwz-0.12-10.el8.ppc64le efi-srpm-macros-3-3.el8.noarch elfutils-0.189-3.el8.ppc64le elfutils-default-yama-scope-0.189-3.el8.noarch elfutils-libelf-0.189-3.el8.ppc64le elfutils-libs-0.189-3.el8.ppc64le epel-rpm-macros-8-41.noarch expat-2.2.5-11.el8_9.1.ppc64le file-5.33-25.el8.ppc64le file-libs-5.33-25.el8.ppc64le filesystem-3.8-6.el8.ppc64le findutils-1:4.6.0-21.el8.ppc64le fpc-srpm-macros-1.3-1.el8.noarch gawk-4.2.1-4.el8.ppc64le gc-7.6.4-3.el8.ppc64le gcc-8.5.0-20.el8.ppc64le gcc-c++-8.5.0-20.el8.ppc64le gcc-plugin-annobin-8.5.0-20.el8.ppc64le gdb-headless-8.2-20.el8.ppc64le gdbm-1:1.18-2.el8.ppc64le gdbm-libs-1:1.18-2.el8.ppc64le ghc-srpm-macros-1.4.2-7.el8.noarch glib2-2.56.4-161.el8.ppc64le glibc-2.28-236.el8_9.12.ppc64le glibc-all-langpacks-2.28-236.el8_9.12.ppc64le glibc-common-2.28-236.el8_9.12.ppc64le glibc-devel-2.28-236.el8_9.12.ppc64le glibc-gconv-extra-2.28-236.el8_9.12.ppc64le glibc-headers-2.28-236.el8_9.12.ppc64le gmp-1:6.1.2-10.el8.ppc64le gnupg2-2.2.20-3.el8_6.ppc64le gnutls-3.6.16-8.el8_9.3.ppc64le go-srpm-macros-2-17.el8.noarch grep-3.1-6.el8.ppc64le guile-5:2.0.14-7.el8.ppc64le gzip-1.9-13.el8_5.ppc64le ima-evm-utils-1.3.2-12.el8.ppc64le info-6.5-7.el8.ppc64le isl-0.16.1-6.el8.ppc64le kernel-headers-4.18.0-513.24.1.el8_9.ppc64le keyutils-libs-1.5.10-9.el8.ppc64le krb5-libs-1.18.2-26.el8_9.ppc64le libacl-2.2.53-1.el8.ppc64le libarchive-3.3.3-5.el8.ppc64le libasan-8.5.0-20.el8.ppc64le libassuan-2.5.1-3.el8.ppc64le libatomic-8.5.0-20.el8.ppc64le libatomic_ops-7.6.2-3.el8.ppc64le libattr-2.4.48-3.el8.ppc64le libbabeltrace-1.5.4-4.el8.ppc64le libblkid-2.32.1-44.el8_9.1.ppc64le libcap-2.48-6.el8_9.ppc64le libcap-ng-0.7.11-1.el8.ppc64le libcom_err-1.45.6-5.el8.ppc64le libcurl-7.61.1-33.el8_9.5.ppc64le libdb-5.3.28-42.el8_4.ppc64le libdb-utils-5.3.28-42.el8_4.ppc64le libfdisk-2.32.1-44.el8_9.1.ppc64le libffi-3.1-24.el8.ppc64le libgcc-8.5.0-20.el8.ppc64le libgcrypt-1.8.5-7.el8_6.ppc64le libgomp-8.5.0-20.el8.ppc64le libgpg-error-1.31-1.el8.ppc64le libidn2-2.2.0-1.el8.ppc64le libksba-1.3.5-9.el8_7.ppc64le libmount-2.32.1-44.el8_9.1.ppc64le libmpc-1.1.0-9.1.el8.ppc64le libnghttp2-1.33.0-5.el8_9.ppc64le libnsl2-1.2.0-2.20180605git4a062cf.el8.ppc64le libpkgconf-1.4.2-1.el8.ppc64le libpsl-0.20.2-6.el8.ppc64le libpwquality-1.4.4-6.el8.ppc64le librtas-2.0.2-1.el8.ppc64le libselinux-2.9-8.el8.ppc64le libsemanage-2.9-9.el8_6.ppc64le libsepol-2.9-3.el8.ppc64le libsigsegv-2.11-5.el8.ppc64le libsmartcols-2.32.1-44.el8_9.1.ppc64le libssh-0.9.6-13.el8_9.ppc64le libssh-config-0.9.6-13.el8_9.noarch libstdc++-8.5.0-20.el8.ppc64le libstdc++-devel-8.5.0-20.el8.ppc64le libtasn1-4.13-4.el8_7.ppc64le libtirpc-1.1.4-8.el8.ppc64le libtool-ltdl-2.4.6-25.el8.ppc64le libubsan-8.5.0-20.el8.ppc64le libunistring-0.9.9-3.el8.ppc64le libusbx-1.0.23-4.el8.ppc64le libutempter-1.1.6-14.el8.ppc64le libuuid-2.32.1-44.el8_9.1.ppc64le libverto-0.3.2-2.el8.ppc64le libxcrypt-4.1.1-6.el8.ppc64le libxcrypt-devel-4.1.1-6.el8.ppc64le libxml2-2.9.7-18.el8_9.ppc64le libzstd-1.4.4-1.el8.ppc64le lua-libs-5.3.4-12.el8.ppc64le lua-srpm-macros-1-13.el8.noarch lz4-libs-1.8.3-3.el8_4.ppc64le make-1:4.2.1-11.el8.ppc64le mpfr-3.1.6-1.el8.ppc64le ncurses-6.1-10.20180224.el8.ppc64le ncurses-base-6.1-10.20180224.el8.noarch ncurses-libs-6.1-10.20180224.el8.ppc64le nettle-3.4.1-7.el8.ppc64le npth-1.5-4.el8.ppc64le ocaml-srpm-macros-5-4.el8.noarch openblas-srpm-macros-2-2.el8.noarch openldap-2.4.46-18.el8.ppc64le openssl-libs-1:1.1.1k-12.el8_9.ppc64le p11-kit-0.23.22-1.el8.ppc64le p11-kit-trust-0.23.22-1.el8.ppc64le pam-1.3.1-27.el8.ppc64le patch-2.7.6-11.el8.ppc64le pcre-8.42-6.el8.ppc64le pcre2-10.32-3.el8_6.ppc64le perl-srpm-macros-1-25.el8.noarch pkgconf-1.4.2-1.el8.ppc64le pkgconf-m4-1.4.2-1.el8.noarch pkgconf-pkg-config-1.4.2-1.el8.ppc64le platform-python-3.6.8-56.el8_9.3.ppc64le platform-python-setuptools-39.2.0-7.el8.noarch popt-1.18-1.el8.ppc64le publicsuffix-list-dafsa-20180723-1.el8.noarch python-rpm-macros-3-45.el8.noarch python-srpm-macros-3-45.el8.noarch python3-libs-3.6.8-56.el8_9.3.ppc64le python3-pip-wheel-9.0.3-23.el8_9.1.noarch python3-rpm-macros-3-45.el8.noarch python3-setuptools-wheel-39.2.0-7.el8.noarch qt5-srpm-macros-5.15.3-1.el8.noarch readline-7.0-10.el8.ppc64le redhat-release-8.9-0.1.el8.ppc64le redhat-rpm-config-131-1.el8.noarch rpm-4.14.3-28.el8_9.ppc64le rpm-build-4.14.3-28.el8_9.ppc64le rpm-build-libs-4.14.3-28.el8_9.ppc64le rpm-libs-4.14.3-28.el8_9.ppc64le rust-srpm-macros-5-2.el8.noarch sed-4.5-5.el8.ppc64le setup-2.12.2-9.el8.noarch shadow-utils-2:4.6-19.el8.ppc64le sqlite-libs-3.26.0-19.el8_9.ppc64le systemd-libs-239-78.el8.ppc64le tar-2:1.30-9.el8.ppc64le tpm2-tss-2.3.2-5.el8.ppc64le tzdata-2024a-1.el8.noarch unzip-6.0-46.el8.ppc64le util-linux-2.32.1-44.el8_9.1.ppc64le which-2.21-20.el8.ppc64le xz-5.2.4-4.el8_6.ppc64le xz-libs-5.2.4-4.el8_6.ppc64le zip-3.0-23.el8.ppc64le zlib-1.2.11-25.el8.ppc64le zstd-1.4.4-1.el8.ppc64le Complete! Finish: installing minimal buildroot with dnf Start: creating root cache Finish: creating root cache Finish: chroot init INFO: Installed packages: INFO: annobin-11.13-2.el8.ppc64le ansible-srpm-macros-1-12.el8.noarch audit-libs-3.0.7-5.el8.ppc64le basesystem-11-5.el8.noarch bash-4.4.20-4.el8_6.ppc64le binutils-2.30-123.el8.ppc64le brotli-1.0.6-3.el8.ppc64le bzip2-1.0.6-26.el8.ppc64le bzip2-libs-1.0.6-26.el8.ppc64le ca-certificates-2023.2.60_v7.0.306-80.0.el8_8.noarch chkconfig-1.19.2-1.el8.ppc64le coreutils-8.30-15.el8.ppc64le coreutils-common-8.30-15.el8.ppc64le cpio-2.12-11.el8.ppc64le cpp-8.5.0-20.el8.ppc64le cracklib-2.9.6-15.el8.ppc64le cracklib-dicts-2.9.6-15.el8.ppc64le crypto-policies-20230731-1.git3177e06.el8.noarch curl-7.61.1-33.el8_9.5.ppc64le cyrus-sasl-lib-2.1.27-6.el8_5.ppc64le diffutils-3.6-6.el8.ppc64le dwz-0.12-10.el8.ppc64le efi-srpm-macros-3-3.el8.noarch elfutils-0.189-3.el8.ppc64le elfutils-default-yama-scope-0.189-3.el8.noarch elfutils-libelf-0.189-3.el8.ppc64le elfutils-libs-0.189-3.el8.ppc64le epel-rpm-macros-8-41.noarch expat-2.2.5-11.el8_9.1.ppc64le file-5.33-25.el8.ppc64le file-libs-5.33-25.el8.ppc64le filesystem-3.8-6.el8.ppc64le findutils-4.6.0-21.el8.ppc64le fpc-srpm-macros-1.3-1.el8.noarch gawk-4.2.1-4.el8.ppc64le gc-7.6.4-3.el8.ppc64le gcc-8.5.0-20.el8.ppc64le gcc-c++-8.5.0-20.el8.ppc64le gcc-plugin-annobin-8.5.0-20.el8.ppc64le gdb-headless-8.2-20.el8.ppc64le gdbm-1.18-2.el8.ppc64le gdbm-libs-1.18-2.el8.ppc64le ghc-srpm-macros-1.4.2-7.el8.noarch glib2-2.56.4-161.el8.ppc64le glibc-2.28-236.el8_9.12.ppc64le glibc-all-langpacks-2.28-236.el8_9.12.ppc64le glibc-common-2.28-236.el8_9.12.ppc64le glibc-devel-2.28-236.el8_9.12.ppc64le glibc-gconv-extra-2.28-236.el8_9.12.ppc64le glibc-headers-2.28-236.el8_9.12.ppc64le gmp-6.1.2-10.el8.ppc64le gnupg2-2.2.20-3.el8_6.ppc64le gnutls-3.6.16-8.el8_9.3.ppc64le go-srpm-macros-2-17.el8.noarch gpg-pubkey-2f86d6a1-5cf7cefb gpg-pubkey-2fa658e0-45700c69 gpg-pubkey-fd431d51-4ae0493b grep-3.1-6.el8.ppc64le guile-2.0.14-7.el8.ppc64le gzip-1.9-13.el8_5.ppc64le ima-evm-utils-1.3.2-12.el8.ppc64le info-6.5-7.el8.ppc64le isl-0.16.1-6.el8.ppc64le kernel-headers-4.18.0-513.24.1.el8_9.ppc64le keyutils-libs-1.5.10-9.el8.ppc64le krb5-libs-1.18.2-26.el8_9.ppc64le libacl-2.2.53-1.el8.ppc64le libarchive-3.3.3-5.el8.ppc64le libasan-8.5.0-20.el8.ppc64le libassuan-2.5.1-3.el8.ppc64le libatomic-8.5.0-20.el8.ppc64le libatomic_ops-7.6.2-3.el8.ppc64le libattr-2.4.48-3.el8.ppc64le libbabeltrace-1.5.4-4.el8.ppc64le libblkid-2.32.1-44.el8_9.1.ppc64le libcap-2.48-6.el8_9.ppc64le libcap-ng-0.7.11-1.el8.ppc64le libcom_err-1.45.6-5.el8.ppc64le libcurl-7.61.1-33.el8_9.5.ppc64le libdb-5.3.28-42.el8_4.ppc64le libdb-utils-5.3.28-42.el8_4.ppc64le libfdisk-2.32.1-44.el8_9.1.ppc64le libffi-3.1-24.el8.ppc64le libgcc-8.5.0-20.el8.ppc64le libgcrypt-1.8.5-7.el8_6.ppc64le libgomp-8.5.0-20.el8.ppc64le libgpg-error-1.31-1.el8.ppc64le libidn2-2.2.0-1.el8.ppc64le libksba-1.3.5-9.el8_7.ppc64le libmount-2.32.1-44.el8_9.1.ppc64le libmpc-1.1.0-9.1.el8.ppc64le libnghttp2-1.33.0-5.el8_9.ppc64le libnsl2-1.2.0-2.20180605git4a062cf.el8.ppc64le libpkgconf-1.4.2-1.el8.ppc64le libpsl-0.20.2-6.el8.ppc64le libpwquality-1.4.4-6.el8.ppc64le librtas-2.0.2-1.el8.ppc64le libselinux-2.9-8.el8.ppc64le libsemanage-2.9-9.el8_6.ppc64le libsepol-2.9-3.el8.ppc64le libsigsegv-2.11-5.el8.ppc64le libsmartcols-2.32.1-44.el8_9.1.ppc64le libssh-0.9.6-13.el8_9.ppc64le libssh-config-0.9.6-13.el8_9.noarch libstdc++-8.5.0-20.el8.ppc64le libstdc++-devel-8.5.0-20.el8.ppc64le libtasn1-4.13-4.el8_7.ppc64le libtirpc-1.1.4-8.el8.ppc64le libtool-ltdl-2.4.6-25.el8.ppc64le libubsan-8.5.0-20.el8.ppc64le libunistring-0.9.9-3.el8.ppc64le libusbx-1.0.23-4.el8.ppc64le libutempter-1.1.6-14.el8.ppc64le libuuid-2.32.1-44.el8_9.1.ppc64le libverto-0.3.2-2.el8.ppc64le libxcrypt-4.1.1-6.el8.ppc64le libxcrypt-devel-4.1.1-6.el8.ppc64le libxml2-2.9.7-18.el8_9.ppc64le libzstd-1.4.4-1.el8.ppc64le lua-libs-5.3.4-12.el8.ppc64le lua-srpm-macros-1-13.el8.noarch lz4-libs-1.8.3-3.el8_4.ppc64le make-4.2.1-11.el8.ppc64le mpfr-3.1.6-1.el8.ppc64le ncurses-6.1-10.20180224.el8.ppc64le ncurses-base-6.1-10.20180224.el8.noarch ncurses-libs-6.1-10.20180224.el8.ppc64le nettle-3.4.1-7.el8.ppc64le npth-1.5-4.el8.ppc64le ocaml-srpm-macros-5-4.el8.noarch openblas-srpm-macros-2-2.el8.noarch openldap-2.4.46-18.el8.ppc64le openssl-libs-1.1.1k-12.el8_9.ppc64le p11-kit-0.23.22-1.el8.ppc64le p11-kit-trust-0.23.22-1.el8.ppc64le pam-1.3.1-27.el8.ppc64le patch-2.7.6-11.el8.ppc64le pcre-8.42-6.el8.ppc64le pcre2-10.32-3.el8_6.ppc64le perl-srpm-macros-1-25.el8.noarch pkgconf-1.4.2-1.el8.ppc64le pkgconf-m4-1.4.2-1.el8.noarch pkgconf-pkg-config-1.4.2-1.el8.ppc64le platform-python-3.6.8-56.el8_9.3.ppc64le platform-python-setuptools-39.2.0-7.el8.noarch popt-1.18-1.el8.ppc64le publicsuffix-list-dafsa-20180723-1.el8.noarch python-rpm-macros-3-45.el8.noarch python-srpm-macros-3-45.el8.noarch python3-libs-3.6.8-56.el8_9.3.ppc64le python3-pip-wheel-9.0.3-23.el8_9.1.noarch python3-rpm-macros-3-45.el8.noarch python3-setuptools-wheel-39.2.0-7.el8.noarch qt5-srpm-macros-5.15.3-1.el8.noarch readline-7.0-10.el8.ppc64le redhat-release-8.9-0.1.el8.ppc64le redhat-rpm-config-131-1.el8.noarch rpm-4.14.3-28.el8_9.ppc64le rpm-build-4.14.3-28.el8_9.ppc64le rpm-build-libs-4.14.3-28.el8_9.ppc64le rpm-libs-4.14.3-28.el8_9.ppc64le rust-srpm-macros-5-2.el8.noarch sed-4.5-5.el8.ppc64le setup-2.12.2-9.el8.noarch shadow-utils-4.6-19.el8.ppc64le sqlite-libs-3.26.0-19.el8_9.ppc64le systemd-libs-239-78.el8.ppc64le tar-1.30-9.el8.ppc64le tpm2-tss-2.3.2-5.el8.ppc64le tzdata-2024a-1.el8.noarch unzip-6.0-46.el8.ppc64le util-linux-2.32.1-44.el8_9.1.ppc64le which-2.21-20.el8.ppc64le xz-5.2.4-4.el8_6.ppc64le xz-libs-5.2.4-4.el8_6.ppc64le zip-3.0-23.el8.ppc64le zlib-1.2.11-25.el8.ppc64le zstd-1.4.4-1.el8.ppc64le Start: buildsrpm Start: rpmbuild -bs sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: ppc64le Building for target ppc64le Wrote: /builddir/build/SRPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Finish: rpmbuild -bs cp: preserving permissions for ‘/var/lib/copr-rpmbuild/results/chroot_scan/var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root/var/log’: No such file or directory INFO: chroot_scan: 3 files copied to /var/lib/copr-rpmbuild/results/chroot_scan INFO: /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root/var/log/dnf.log /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root/var/log/dnf.librepo.log /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root/var/log/dnf.rpm.log Finish: buildsrpm INFO: Done(/var/lib/copr-rpmbuild/workspace/workdir-7hta920e/cutlass/cutlass.spec) Config(child) 2 minutes 11 seconds INFO: Results and/or logs in: /var/lib/copr-rpmbuild/results INFO: Cleaning up build root ('cleanup_on_success=True') Start: clean chroot INFO: unmounting tmpfs. Finish: clean chroot INFO: Start(/var/lib/copr-rpmbuild/results/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm) Config(rhel+epel-8-ppc64le) Start(bootstrap): chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-ppc64le-bootstrap-1713469148.550083/root. INFO: reusing tmpfs at /var/lib/mock/rhel+epel-8-ppc64le-bootstrap-1713469148.550083/root. INFO: calling preinit hooks INFO: enabled root cache INFO: enabled package manager cache Start(bootstrap): cleaning package manager metadata Finish(bootstrap): cleaning package manager metadata Finish(bootstrap): chroot init Start: chroot init INFO: mounting tmpfs at /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root. INFO: calling preinit hooks INFO: enabled root cache Start: unpacking root cache Finish: unpacking root cache INFO: enabled package manager cache Start: cleaning package manager metadata Finish: cleaning package manager metadata INFO: enabled HW Info plugin INFO: Buildroot is handled by package management downloaded with a bootstrap image: rpm-4.14.3-28.el8_9.ppc64le python3-dnf-4.7.0-19.el8.noarch python3-dnf-plugins-core-4.0.21-23.el8.noarch yum-4.7.0-19.el8.noarch Finish: chroot init Start: build phase for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Start: build setup for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: ppc64le Building for target ppc64le Wrote: /builddir/build/SRPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm No matches found for the following disable plugin patterns: local, spacewalk, versionlock Updating Subscription Management repositories. Unable to read consumer identity This system is not registered with an entitlement server. You can use subscription-manager to register. Copr repository 24 kB/s | 1.8 kB 00:00 Additional repo copr_rezso_CUDA 41 kB/s | 1.8 kB 00:00 Additional repo http_developer_download_nvidia_ 37 kB/s | 3.5 kB 00:00 Additional repo http_developer_download_nvidia_ 18 kB/s | 3.5 kB 00:00 Additional repo http_developer_download_nvidia_ 56 kB/s | 3.5 kB 00:00 Red Hat Enterprise Linux - BaseOS 30 kB/s | 4.1 kB 00:00 Red Hat Enterprise Linux - AppStream 27 kB/s | 4.5 kB 00:00 Red Hat Enterprise Linux - CodeReady Linux Buil 29 kB/s | 4.5 kB 00:00 Extra Packages for Enterprise Linux 8 - ppc64le 91 kB/s | 17 kB 00:00 Modular dependency problems: Problem 1: nothing provides requested module(nvidia-driver:latest-dkms:20240416083839) Problem 2: nothing provides requested module(nvidia-driver:latest-dkms:20240416084055) Package gcc-c++-8.5.0-20.el8.ppc64le is already installed. Dependencies resolved. =================================================================================================================================================================== Package Arch Version Repository Size =================================================================================================================================================================== Installing: cmake ppc64le 3.26.5-1.el8_9 rhel-appstream 13 M cuda-cudart-devel-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 2.0 M cuda-driver-devel-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 37 k cuda-nvcc-12-4 ppc64le 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 67 M cuda-nvml-devel-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 220 k cuda-nvrtc-devel-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 27 M cuda-nvtx-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 88 k doxygen ppc64le 1:1.8.14-12.el8 codeready-builder 3.8 M git ppc64le 2.39.3-1.el8_8 rhel-appstream 104 k graphviz ppc64le 2.40.1-44.el8 rhel-appstream 2.1 M libcublas-devel-12-4 ppc64le 12.4.5.8-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 278 M libcudnn8 ppc64le 8.9.7.29-2.cuda12.3 copr_rezso_CUDA 467 M libcudnn8-devel ppc64le 8.9.7.29-2.cuda12.3 copr_rezso_CUDA 35 k libcurand-devel-12-4 ppc64le 10.3.5.147-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 53 M python3-setuptools noarch 39.2.0-7.el8 rhel-baseos 163 k python36-devel ppc64le 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 17 k Installing dependencies: adobe-mappings-cmap noarch 20171205-3.el8 rhel-appstream 2.1 M adobe-mappings-cmap-deprecated noarch 20171205-3.el8 rhel-appstream 119 k adobe-mappings-pdf noarch 20180407-1.el8 rhel-appstream 707 k atk ppc64le 2.28.1-1.el8 rhel-appstream 275 k avahi-libs ppc64le 0.7-21.el8_9.1 rhel-baseos 67 k cairo ppc64le 1.15.12-6.el8 rhel-appstream 775 k cmake-data noarch 3.26.5-1.el8_9 rhel-appstream 1.9 M cmake-filesystem ppc64le 3.26.5-1.el8_9 rhel-appstream 45 k cmake-rpm-macros noarch 3.26.5-1.el8_9 rhel-appstream 44 k cuda-cccl-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 1.9 M cuda-crt-12-4 ppc64le 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 112 k cuda-cudart-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 220 k cuda-nvrtc-12-4 ppc64le 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 23 M cuda-nvvm-12-4 ppc64le 12.4.131-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 26 M cuda-toolkit-12-4-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.7 k cuda-toolkit-12-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.9 k cuda-toolkit-config-common noarch 12.4.127-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_x86_64 7.9 k cups-libs ppc64le 1:2.2.6-54.el8_9 rhel-baseos 491 k dbus-libs ppc64le 1:1.12.8-26.el8 rhel-baseos 199 k emacs-filesystem noarch 1:26.1-11.el8 rhel-baseos 70 k fontconfig ppc64le 2.13.1-4.el8 rhel-baseos 295 k fontpackages-filesystem noarch 1.44-22.el8 rhel-baseos 16 k freetype ppc64le 2.9.1-9.el8 rhel-baseos 430 k fribidi ppc64le 1.0.4-9.el8 rhel-appstream 92 k gd ppc64le 2.2.5-7.el8 rhel-appstream 156 k gdk-pixbuf2 ppc64le 2.36.12-5.el8 rhel-baseos 472 k gdk-pixbuf2-modules ppc64le 2.36.12-5.el8 rhel-appstream 116 k git-core ppc64le 2.39.3-1.el8_8 rhel-appstream 12 M git-core-doc noarch 2.39.3-1.el8_8 rhel-appstream 3.0 M google-droid-sans-fonts noarch 20120715-13.el8 rhel-appstream 2.5 M graphite2 ppc64le 1.3.10-10.el8 rhel-appstream 129 k groff-base ppc64le 1.22.3-18.el8 rhel-baseos 1.0 M gtk-update-icon-cache ppc64le 3.22.30-11.el8 rhel-appstream 34 k gtk2 ppc64le 2.24.32-5.el8 rhel-appstream 3.6 M harfbuzz ppc64le 1.7.5-3.el8 rhel-appstream 313 k hicolor-icon-theme noarch 0.17-2.el8 rhel-appstream 48 k jasper-libs ppc64le 2.0.14-5.el8 rhel-appstream 180 k jbig2dec-libs ppc64le 0.16-1.el8 rhel-appstream 75 k jbigkit-libs ppc64le 2.1-14.el8 rhel-appstream 57 k lcms2 ppc64le 2.9-2.el8 rhel-appstream 182 k less ppc64le 530-2.el8_9 rhel-baseos 175 k libICE ppc64le 1.0.9-15.el8 rhel-appstream 78 k libSM ppc64le 1.2.3-1.el8 rhel-appstream 48 k libX11 ppc64le 1.6.8-6.el8 rhel-appstream 651 k libX11-common noarch 1.6.8-6.el8 rhel-appstream 158 k libXau ppc64le 1.0.9-3.el8 rhel-appstream 38 k libXaw ppc64le 1.0.13-10.el8 rhel-appstream 194 k libXcomposite ppc64le 0.4.4-14.el8 rhel-appstream 29 k libXcursor ppc64le 1.1.15-3.el8 rhel-appstream 39 k libXdamage ppc64le 1.1.4-14.el8 rhel-appstream 27 k libXext ppc64le 1.3.4-1.el8 rhel-appstream 47 k libXfixes ppc64le 5.0.3-7.el8 rhel-appstream 25 k libXft ppc64le 2.3.3-1.el8 rhel-appstream 71 k libXi ppc64le 1.7.10-1.el8 rhel-appstream 50 k libXinerama ppc64le 1.1.4-1.el8 rhel-appstream 16 k libXmu ppc64le 1.1.3-1.el8 rhel-appstream 81 k libXpm ppc64le 3.5.12-9.el8_7 rhel-appstream 63 k libXrandr ppc64le 1.5.2-1.el8 rhel-appstream 34 k libXrender ppc64le 0.9.10-7.el8 rhel-appstream 34 k libXt ppc64le 1.1.5-12.el8 rhel-appstream 194 k libXxf86misc ppc64le 1.0.4-1.el8 rhel-appstream 23 k libXxf86vm ppc64le 1.1.4-9.el8 rhel-appstream 20 k libcroco ppc64le 0.6.12-4.el8_2.1 rhel-baseos 123 k libcublas-12-4 ppc64le 12.4.5.8-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 240 M libcurand-12-4 ppc64le 10.3.5.147-1 http_developer_download_nvidia_com_compute_cuda_repos_rhel8_ppc64le 53 M libdatrie ppc64le 0.2.9-7.el8 rhel-appstream 35 k libedit ppc64le 3.1-23.20170329cvs.el8 rhel-baseos 110 k libfontenc ppc64le 1.1.3-8.el8 rhel-appstream 37 k libgs ppc64le 9.27-11.el8 rhel-appstream 3.2 M libidn ppc64le 1.34-5.el8 rhel-appstream 241 k libijs ppc64le 0.35-5.el8 rhel-appstream 31 k libjpeg-turbo ppc64le 1.5.3-12.el8 rhel-appstream 171 k libmcpp ppc64le 2.7.2-20.el8 rhel-appstream 89 k libpaper ppc64le 1.1.24-22.el8 rhel-appstream 45 k libpng ppc64le 2:1.6.34-5.el8 rhel-baseos 140 k librsvg2 ppc64le 2.42.7-5.el8 rhel-appstream 586 k libthai ppc64le 0.1.27-2.el8 rhel-appstream 204 k libtiff ppc64le 4.0.9-29.el8_8 rhel-appstream 200 k libuv ppc64le 1:1.41.1-1.el8_4 rhel-appstream 162 k libwebp ppc64le 1.0.0-9.el8_9.1 rhel-appstream 255 k libxcb ppc64le 1.13.1-1.el8 rhel-appstream 238 k mcpp ppc64le 2.7.2-20.el8 rhel-appstream 32 k openjpeg2 ppc64le 2.4.0-5.el8 rhel-appstream 181 k openssh ppc64le 8.0p1-19.el8_9.2 rhel-baseos 523 k openssh-clients ppc64le 8.0p1-19.el8_9.2 rhel-baseos 689 k openssl ppc64le 1:1.1.1k-12.el8_9 rhel-baseos 714 k pango ppc64le 1.42.4-8.el8 rhel-appstream 314 k perl-Carp noarch 1.42-396.el8 rhel-baseos 30 k perl-Data-Dumper ppc64le 2.167-399.el8 rhel-baseos 59 k perl-Digest noarch 1.17-395.el8 rhel-baseos 27 k perl-Digest-MD5 ppc64le 2.55-396.el8 rhel-baseos 38 k perl-Encode ppc64le 4:2.97-3.el8 rhel-baseos 1.5 M perl-Errno ppc64le 1.28-422.el8 rhel-baseos 77 k perl-Error noarch 1:0.17025-2.el8 rhel-appstream 46 k perl-Exporter noarch 5.72-396.el8 rhel-baseos 34 k perl-File-Path noarch 2.15-2.el8 rhel-baseos 38 k perl-File-Temp noarch 0.230.600-1.el8 rhel-baseos 63 k perl-Getopt-Long noarch 1:2.50-4.el8 rhel-baseos 63 k perl-Git noarch 2.39.3-1.el8_8 rhel-appstream 79 k perl-HTTP-Tiny noarch 0.074-2.el8_9.1 rhel-baseos 59 k perl-IO ppc64le 1.38-422.el8 rhel-baseos 143 k perl-IO-Socket-IP noarch 0.39-5.el8 rhel-baseos 47 k perl-IO-Socket-SSL noarch 2.066-4.module+el8.3.0+6446+594cad75 rhel-appstream 298 k perl-MIME-Base64 ppc64le 3.15-396.el8 rhel-baseos 31 k perl-Mozilla-CA noarch 20160104-7.module+el8.3.0+6498+9eecfe51 rhel-appstream 15 k perl-Net-SSLeay ppc64le 1.88-2.module+el8.6.0+13392+f0897f98 rhel-appstream 382 k perl-PathTools ppc64le 3.74-1.el8 rhel-baseos 91 k perl-Pod-Escapes noarch 1:1.07-395.el8 rhel-baseos 20 k perl-Pod-Perldoc noarch 3.28-396.el8 rhel-baseos 88 k perl-Pod-Simple noarch 1:3.35-395.el8 rhel-baseos 213 k perl-Pod-Usage noarch 4:1.69-395.el8 rhel-baseos 34 k perl-Scalar-List-Utils ppc64le 3:1.49-2.el8 rhel-baseos 71 k perl-Socket ppc64le 4:2.027-3.el8 rhel-baseos 59 k perl-Storable ppc64le 1:3.11-3.el8 rhel-baseos 100 k perl-Term-ANSIColor noarch 4.06-396.el8 rhel-baseos 46 k perl-Term-Cap noarch 1.17-395.el8 rhel-baseos 23 k perl-TermReadKey ppc64le 2.37-7.el8 rhel-appstream 42 k perl-Text-ParseWords noarch 3.30-395.el8 rhel-baseos 18 k perl-Text-Tabs+Wrap noarch 2013.0523-395.el8 rhel-baseos 24 k perl-Time-Local noarch 1:1.280-1.el8 rhel-baseos 34 k perl-URI noarch 1.73-3.el8 rhel-baseos 116 k perl-Unicode-Normalize ppc64le 1.25-396.el8 rhel-baseos 80 k perl-constant noarch 1.33-396.el8 rhel-baseos 25 k perl-interpreter ppc64le 4:5.26.3-422.el8 rhel-baseos 6.3 M perl-libnet noarch 3.11-3.el8 rhel-baseos 121 k perl-libs ppc64le 4:5.26.3-422.el8 rhel-baseos 1.6 M perl-macros ppc64le 4:5.26.3-422.el8 rhel-baseos 73 k perl-parent noarch 1:0.237-1.el8 rhel-baseos 20 k perl-podlators noarch 4.11-1.el8 rhel-baseos 118 k perl-threads ppc64le 1:2.21-2.el8 rhel-baseos 62 k perl-threads-shared ppc64le 1.58-2.el8 rhel-baseos 49 k pixman ppc64le 0.38.4-3.el8_9 rhel-appstream 201 k platform-python-devel ppc64le 3.6.8-56.el8_9.3 rhel-appstream 241 k platform-python-pip noarch 9.0.3-23.el8_9.1 rhel-baseos 1.6 M python3-pip noarch 9.0.3-23.el8_9.1 rhel-appstream 20 k python3-rpm-generators noarch 5-8.el8 rhel-appstream 25 k python36 ppc64le 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 19 k python36-rpm-macros noarch 3.6.8-38.module+el8.9.0+20976+d3c38525 rhel-appstream 16 k shared-mime-info ppc64le 1.9-3.el8 rhel-baseos 332 k urw-base35-bookman-fonts noarch 20170801-10.el8 rhel-appstream 857 k urw-base35-c059-fonts noarch 20170801-10.el8 rhel-appstream 884 k urw-base35-d050000l-fonts noarch 20170801-10.el8 rhel-appstream 79 k urw-base35-fonts noarch 20170801-10.el8 rhel-appstream 12 k urw-base35-fonts-common noarch 20170801-10.el8 rhel-appstream 23 k urw-base35-gothic-fonts noarch 20170801-10.el8 rhel-appstream 654 k urw-base35-nimbus-mono-ps-fonts noarch 20170801-10.el8 rhel-appstream 801 k urw-base35-nimbus-roman-fonts noarch 20170801-10.el8 rhel-appstream 865 k urw-base35-nimbus-sans-fonts noarch 20170801-10.el8 rhel-appstream 1.3 M urw-base35-p052-fonts noarch 20170801-10.el8 rhel-appstream 982 k urw-base35-standard-symbols-ps-fonts noarch 20170801-10.el8 rhel-appstream 44 k urw-base35-z003-fonts noarch 20170801-10.el8 rhel-appstream 279 k vim-filesystem noarch 2:8.0.1763-19.el8_6.4 rhel-appstream 50 k xorg-x11-font-utils ppc64le 1:7.5-41.el8 rhel-appstream 113 k xorg-x11-fonts-ISO8859-1-100dpi noarch 7.5-19.el8 rhel-appstream 1.1 M xorg-x11-server-utils ppc64le 7.7-27.el8 rhel-appstream 212 k Enabling module streams: perl 5.26 perl-IO-Socket-SSL 2.066 perl-libwww-perl 6.34 python36 3.6 Transaction Summary =================================================================================================================================================================== Install 171 Packages Total download size: 1.3 G Installed size: 3.0 G Downloading Packages: (1/171): cuda-toolkit-12-4-config-common-12.4.1 104 kB/s | 7.7 kB 00:00 (2/171): cuda-toolkit-12-config-common-12.4.127 225 kB/s | 7.9 kB 00:00 (3/171): cuda-toolkit-config-common-12.4.127-1. 242 kB/s | 7.9 kB 00:00 (4/171): libcudnn8-devel-8.9.7.29-2.cuda12.3.pp 144 kB/s | 35 kB 00:00 (5/171): cuda-cccl-12-4-12.4.127-1.ppc64le.rpm 8.2 MB/s | 1.9 MB 00:00 (6/171): cuda-cudart-12-4-12.4.127-1.ppc64le.rp 5.2 MB/s | 220 kB 00:00 (7/171): cuda-crt-12-4-12.4.131-1.ppc64le.rpm 624 kB/s | 112 kB 00:00 (8/171): cuda-driver-devel-12-4-12.4.127-1.ppc6 939 kB/s | 37 kB 00:00 (9/171): cuda-cudart-devel-12-4-12.4.127-1.ppc6 21 MB/s | 2.0 MB 00:00 (10/171): cuda-nvml-devel-12-4-12.4.127-1.ppc64 5.4 MB/s | 220 kB 00:00 (11/171): cuda-nvrtc-12-4-12.4.127-1.ppc64le.rp 33 MB/s | 23 MB 00:00 (12/171): cuda-nvrtc-devel-12-4-12.4.127-1.ppc6 48 MB/s | 27 MB 00:00 (13/171): cuda-nvtx-12-4-12.4.127-1.ppc64le.rpm 4.0 MB/s | 88 kB 00:00 (14/171): cuda-nvvm-12-4-12.4.131-1.ppc64le.rpm 47 MB/s | 26 MB 00:00 (15/171): libcudnn8-8.9.7.29-2.cuda12.3.ppc64le 48 MB/s | 467 MB 00:09 (16/171): cuda-nvcc-12-4-12.4.131-1.ppc64le.rpm 5.1 MB/s | 67 MB 00:13 (17/171): libcublas-12-4-12.4.5.8-1.ppc64le.rpm 12 MB/s | 240 MB 00:19 (18/171): libcurand-12-4-10.3.5.147-1.ppc64le.r 3.9 MB/s | 53 MB 00:13 (19/171): groff-base-1.22.3-18.el8.ppc64le.rpm 3.2 MB/s | 1.0 MB 00:00 (20/171): libedit-3.1-23.20170329cvs.el8.ppc64l 125 kB/s | 110 kB 00:00 (21/171): libpng-1.6.34-5.el8.ppc64le.rpm 1.1 MB/s | 140 kB 00:00 (22/171): perl-Data-Dumper-2.167-399.el8.ppc64l 325 kB/s | 59 kB 00:00 (23/171): perl-Encode-2.97-3.el8.ppc64le.rpm 10 MB/s | 1.5 MB 00:00 (24/171): perl-MIME-Base64-3.15-396.el8.ppc64le 225 kB/s | 31 kB 00:00 (25/171): perl-PathTools-3.74-1.el8.ppc64le.rpm 1.8 MB/s | 91 kB 00:00 (26/171): perl-Scalar-List-Utils-1.49-2.el8.ppc 1.2 MB/s | 71 kB 00:00 (27/171): perl-Storable-3.11-3.el8.ppc64le.rpm 1.7 MB/s | 100 kB 00:00 (28/171): perl-Unicode-Normalize-1.25-396.el8.p 1.1 MB/s | 80 kB 00:00 (29/171): perl-threads-2.21-2.el8.ppc64le.rpm 1.1 MB/s | 62 kB 00:00 (30/171): perl-threads-shared-1.58-2.el8.ppc64l 1.0 MB/s | 49 kB 00:00 (31/171): shared-mime-info-1.9-3.el8.ppc64le.rp 5.1 MB/s | 332 kB 00:00 (32/171): fontpackages-filesystem-1.44-22.el8.n 186 kB/s | 16 kB 00:00 (33/171): perl-Carp-1.42-396.el8.noarch.rpm 642 kB/s | 30 kB 00:00 (34/171): perl-Exporter-5.72-396.el8.noarch.rpm 634 kB/s | 34 kB 00:00 (35/171): perl-File-Path-2.15-2.el8.noarch.rpm 725 kB/s | 38 kB 00:00 (36/171): perl-File-Temp-0.230.600-1.el8.noarch 906 kB/s | 63 kB 00:00 (37/171): perl-Getopt-Long-2.50-4.el8.noarch.rp 1.0 MB/s | 63 kB 00:00 (38/171): perl-Pod-Escapes-1.07-395.el8.noarch. 383 kB/s | 20 kB 00:00 (39/171): perl-Pod-Perldoc-3.28-396.el8.noarch. 1.3 MB/s | 88 kB 00:00 (40/171): perl-Pod-Simple-3.35-395.el8.noarch.r 3.8 MB/s | 213 kB 00:00 (41/171): perl-Pod-Usage-1.69-395.el8.noarch.rp 188 kB/s | 34 kB 00:00 (42/171): perl-Socket-2.027-3.el8.ppc64le.rpm 936 kB/s | 59 kB 00:00 (43/171): perl-Term-ANSIColor-4.06-396.el8.noar 962 kB/s | 46 kB 00:00 (44/171): perl-Term-Cap-1.17-395.el8.noarch.rpm 278 kB/s | 23 kB 00:00 (45/171): perl-Text-ParseWords-3.30-395.el8.noa 326 kB/s | 18 kB 00:00 (46/171): perl-Text-Tabs+Wrap-2013.0523-395.el8 479 kB/s | 24 kB 00:00 (47/171): perl-Time-Local-1.280-1.el8.noarch.rp 709 kB/s | 34 kB 00:00 (48/171): perl-constant-1.33-396.el8.noarch.rpm 413 kB/s | 25 kB 00:00 (49/171): perl-parent-0.237-1.el8.noarch.rpm 335 kB/s | 20 kB 00:00 (50/171): perl-podlators-4.11-1.el8.noarch.rpm 1.3 MB/s | 118 kB 00:00 (51/171): gdk-pixbuf2-2.36.12-5.el8.ppc64le.rpm 4.7 MB/s | 472 kB 00:00 (52/171): libcroco-0.6.12-4.el8_2.1.ppc64le.rpm 1.6 MB/s | 123 kB 00:00 (53/171): fontconfig-2.13.1-4.el8.ppc64le.rpm 3.7 MB/s | 295 kB 00:00 (54/171): freetype-2.9.1-9.el8.ppc64le.rpm 4.5 MB/s | 430 kB 00:00 (55/171): perl-IO-1.38-422.el8.ppc64le.rpm 2.1 MB/s | 143 kB 00:00 (56/171): perl-interpreter-5.26.3-422.el8.ppc64 25 MB/s | 6.3 MB 00:00 (57/171): perl-libs-5.26.3-422.el8.ppc64le.rpm 16 MB/s | 1.6 MB 00:00 (58/171): perl-macros-5.26.3-422.el8.ppc64le.rp 1.2 MB/s | 73 kB 00:00 (59/171): emacs-filesystem-26.1-11.el8.noarch.r 1.3 MB/s | 70 kB 00:00 (60/171): perl-Errno-1.28-422.el8.ppc64le.rpm 1.4 MB/s | 77 kB 00:00 (61/171): perl-URI-1.73-3.el8.noarch.rpm 1.6 MB/s | 116 kB 00:00 (62/171): python3-setuptools-39.2.0-7.el8.noarc 330 kB/s | 163 kB 00:00 (63/171): avahi-libs-0.7-21.el8_9.1.ppc64le.rpm 1.1 MB/s | 67 kB 00:00 (64/171): cups-libs-2.2.6-54.el8_9.ppc64le.rpm 7.8 MB/s | 491 kB 00:00 (65/171): dbus-libs-1.12.8-26.el8.ppc64le.rpm 2.6 MB/s | 199 kB 00:00 (66/171): openssl-1.1.1k-12.el8_9.ppc64le.rpm 6.4 MB/s | 714 kB 00:00 (67/171): perl-Digest-1.17-395.el8.noarch.rpm 569 kB/s | 27 kB 00:00 (68/171): perl-Digest-MD5-2.55-396.el8.ppc64le. 644 kB/s | 38 kB 00:00 (69/171): perl-IO-Socket-IP-0.39-5.el8.noarch.r 964 kB/s | 47 kB 00:00 (70/171): perl-libnet-3.11-3.el8.noarch.rpm 1.9 MB/s | 121 kB 00:00 (71/171): openssh-8.0p1-19.el8_9.2.ppc64le.rpm 7.8 MB/s | 523 kB 00:00 (72/171): openssh-clients-8.0p1-19.el8_9.2.ppc6 11 MB/s | 689 kB 00:00 (73/171): less-530-2.el8_9.ppc64le.rpm 3.2 MB/s | 175 kB 00:00 (74/171): perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 1.2 MB/s | 59 kB 00:00 (75/171): platform-python-pip-9.0.3-23.el8_9.1. 19 MB/s | 1.6 MB 00:00 (76/171): atk-2.28.1-1.el8.ppc64le.rpm 539 kB/s | 275 kB 00:00 (77/171): graphite2-1.3.10-10.el8.ppc64le.rpm 2.5 MB/s | 129 kB 00:00 (78/171): harfbuzz-1.7.5-3.el8.ppc64le.rpm 5.6 MB/s | 313 kB 00:00 (79/171): lcms2-2.9-2.el8.ppc64le.rpm 3.5 MB/s | 182 kB 00:00 (80/171): libXcursor-1.1.15-3.el8.ppc64le.rpm 790 kB/s | 39 kB 00:00 (81/171): libXdamage-1.1.4-14.el8.ppc64le.rpm 512 kB/s | 27 kB 00:00 (82/171): libXinerama-1.1.4-1.el8.ppc64le.rpm 323 kB/s | 16 kB 00:00 (83/171): libXxf86misc-1.0.4-1.el8.ppc64le.rpm 58 kB/s | 23 kB 00:00 (84/171): libdatrie-0.2.9-7.el8.ppc64le.rpm 664 kB/s | 35 kB 00:00 (85/171): mcpp-2.7.2-20.el8.ppc64le.rpm 42 kB/s | 32 kB 00:00 (86/171): libcublas-devel-12-4-12.4.5.8-1.ppc64 10 MB/s | 278 MB 00:27 (87/171): perl-TermReadKey-2.37-7.el8.ppc64le.r 25 kB/s | 42 kB 00:01 (88/171): jbigkit-libs-2.1-14.el8.ppc64le.rpm 294 kB/s | 57 kB 00:00 (89/171): libSM-1.2.3-1.el8.ppc64le.rpm 70 kB/s | 48 kB 00:00 (90/171): libXcomposite-0.4.4-14.el8.ppc64le.rp 594 kB/s | 29 kB 00:00 (91/171): libXfixes-5.0.3-7.el8.ppc64le.rpm 506 kB/s | 25 kB 00:00 (92/171): libXaw-1.0.13-10.el8.ppc64le.rpm 312 kB/s | 194 kB 00:00 (93/171): libXrender-0.9.10-7.el8.ppc64le.rpm 496 kB/s | 34 kB 00:00 (94/171): libcurand-devel-12-4-10.3.5.147-1.ppc 3.2 MB/s | 53 MB 00:16 (95/171): libfontenc-1.1.3-8.el8.ppc64le.rpm 41 kB/s | 37 kB 00:00 (96/171): libidn-1.34-5.el8.ppc64le.rpm 269 kB/s | 241 kB 00:00 (97/171): libijs-0.35-5.el8.ppc64le.rpm 60 kB/s | 31 kB 00:00 (98/171): libthai-0.1.27-2.el8.ppc64le.rpm 1.1 MB/s | 204 kB 00:00 (99/171): libmcpp-2.7.2-20.el8.ppc64le.rpm 136 kB/s | 89 kB 00:00 (100/171): libpaper-1.1.24-22.el8.ppc64le.rpm 57 kB/s | 45 kB 00:00 (101/171): xorg-x11-server-utils-7.7-27.el8.ppc 231 kB/s | 212 kB 00:00 (102/171): google-droid-sans-fonts-20120715-13. 4.5 MB/s | 2.5 MB 00:00 (103/171): libXxf86vm-1.1.4-9.el8.ppc64le.rpm 43 kB/s | 20 kB 00:00 (104/171): urw-base35-fonts-20170801-10.el8.noa 16 kB/s | 12 kB 00:00 (105/171): urw-base35-gothic-fonts-20170801-10. 810 kB/s | 654 kB 00:00 (106/171): urw-base35-p052-fonts-20170801-10.el 856 kB/s | 982 kB 00:01 (107/171): adobe-mappings-cmap-deprecated-20171 158 kB/s | 119 kB 00:00 (108/171): hicolor-icon-theme-0.17-2.el8.noarch 991 kB/s | 48 kB 00:00 (109/171): adobe-mappings-cmap-20171205-3.el8.n 2.2 MB/s | 2.1 MB 00:00 (110/171): adobe-mappings-pdf-20180407-1.el8.no 1.3 MB/s | 707 kB 00:00 (111/171): urw-base35-bookman-fonts-20170801-10 1.6 MB/s | 857 kB 00:00 (112/171): perl-Error-0.17025-2.el8.noarch.rpm 70 kB/s | 46 kB 00:00 (113/171): urw-base35-c059-fonts-20170801-10.el 1.0 MB/s | 884 kB 00:00 (114/171): urw-base35-fonts-common-20170801-10. 63 kB/s | 23 kB 00:00 (115/171): urw-base35-nimbus-mono-ps-fonts-2017 2.1 MB/s | 801 kB 00:00 (116/171): urw-base35-d050000l-fonts-20170801-1 94 kB/s | 79 kB 00:00 (117/171): urw-base35-nimbus-roman-fonts-201708 882 kB/s | 865 kB 00:00 (118/171): urw-base35-nimbus-sans-fonts-2017080 1.6 MB/s | 1.3 MB 00:00 (119/171): urw-base35-standard-symbols-ps-fonts 52 kB/s | 44 kB 00:00 (120/171): gdk-pixbuf2-modules-2.36.12-5.el8.pp 2.3 MB/s | 116 kB 00:00 (121/171): xorg-x11-fonts-ISO8859-1-100dpi-7.5- 2.3 MB/s | 1.1 MB 00:00 (122/171): urw-base35-z003-fonts-20170801-10.el 335 kB/s | 279 kB 00:00 (123/171): libxcb-1.13.1-1.el8.ppc64le.rpm 4.4 MB/s | 238 kB 00:00 (124/171): perl-Mozilla-CA-20160104-7.module+el 324 kB/s | 15 kB 00:00 (125/171): libXft-2.3.3-1.el8.ppc64le.rpm 1.4 MB/s | 71 kB 00:00 (126/171): perl-IO-Socket-SSL-2.066-4.module+el 5.6 MB/s | 298 kB 00:00 (127/171): gd-2.2.5-7.el8.ppc64le.rpm 3.1 MB/s | 156 kB 00:00 (128/171): libXau-1.0.9-3.el8.ppc64le.rpm 790 kB/s | 38 kB 00:00 (129/171): libICE-1.0.9-15.el8.ppc64le.rpm 157 kB/s | 78 kB 00:00 (130/171): libXt-1.1.5-12.el8.ppc64le.rpm 219 kB/s | 194 kB 00:00 (131/171): libXext-1.3.4-1.el8.ppc64le.rpm 859 kB/s | 47 kB 00:00 (132/171): libXi-1.7.10-1.el8.ppc64le.rpm 1.0 MB/s | 50 kB 00:00 (133/171): libXrandr-1.5.2-1.el8.ppc64le.rpm 715 kB/s | 34 kB 00:00 (134/171): libXmu-1.1.3-1.el8.ppc64le.rpm 160 kB/s | 81 kB 00:00 (135/171): libuv-1.41.1-1.el8_4.ppc64le.rpm 3.1 MB/s | 162 kB 00:00 (136/171): gtk2-2.24.32-5.el8.ppc64le.rpm 6.2 MB/s | 3.6 MB 00:00 (137/171): jasper-libs-2.0.14-5.el8.ppc64le.rpm 3.1 MB/s | 180 kB 00:00 (138/171): libjpeg-turbo-1.5.3-12.el8.ppc64le.r 3.3 MB/s | 171 kB 00:00 (139/171): pango-1.42.4-8.el8.ppc64le.rpm 5.9 MB/s | 314 kB 00:00 (140/171): perl-Net-SSLeay-1.88-2.module+el8.6. 7.3 MB/s | 382 kB 00:00 (141/171): cairo-1.15.12-6.el8.ppc64le.rpm 12 MB/s | 775 kB 00:00 (142/171): vim-filesystem-8.0.1763-19.el8_6.4.n 1.0 MB/s | 50 kB 00:00 (143/171): fribidi-1.0.4-9.el8.ppc64le.rpm 1.8 MB/s | 92 kB 00:00 (144/171): gtk-update-icon-cache-3.22.30-11.el8 548 kB/s | 34 kB 00:00 (145/171): openjpeg2-2.4.0-5.el8.ppc64le.rpm 3.6 MB/s | 181 kB 00:00 (146/171): libXpm-3.5.12-9.el8_7.ppc64le.rpm 1.3 MB/s | 63 kB 00:00 (147/171): jbig2dec-libs-0.16-1.el8.ppc64le.rpm 68 kB/s | 75 kB 00:01 (148/171): xorg-x11-font-utils-7.5-41.el8.ppc64 124 kB/s | 113 kB 00:00 (149/171): graphviz-2.40.1-44.el8.ppc64le.rpm 3.4 MB/s | 2.1 MB 00:00 (150/171): git-2.39.3-1.el8_8.ppc64le.rpm 195 kB/s | 104 kB 00:00 (151/171): git-core-doc-2.39.3-1.el8_8.noarch.r 3.7 MB/s | 3.0 MB 00:00 (152/171): perl-Git-2.39.3-1.el8_8.noarch.rpm 84 kB/s | 79 kB 00:00 (153/171): libtiff-4.0.9-29.el8_8.ppc64le.rpm 3.9 MB/s | 200 kB 00:00 (154/171): libX11-1.6.8-6.el8.ppc64le.rpm 11 MB/s | 651 kB 00:00 (155/171): libX11-common-1.6.8-6.el8.noarch.rpm 3.0 MB/s | 158 kB 00:00 (156/171): git-core-2.39.3-1.el8_8.ppc64le.rpm 9.1 MB/s | 12 MB 00:01 (157/171): python3-rpm-generators-5-8.el8.noarc 37 kB/s | 25 kB 00:00 (158/171): libwebp-1.0.0-9.el8_9.1.ppc64le.rpm 4.9 MB/s | 255 kB 00:00 (159/171): librsvg2-2.42.7-5.el8.ppc64le.rpm 631 kB/s | 586 kB 00:00 (160/171): libgs-9.27-11.el8.ppc64le.rpm 2.9 MB/s | 3.2 MB 00:01 (161/171): cmake-data-3.26.5-1.el8_9.noarch.rpm 25 MB/s | 1.9 MB 00:00 (162/171): cmake-filesystem-3.26.5-1.el8_9.ppc6 927 kB/s | 45 kB 00:00 (163/171): cmake-rpm-macros-3.26.5-1.el8_9.noar 934 kB/s | 44 kB 00:00 (164/171): cmake-3.26.5-1.el8_9.ppc64le.rpm 16 MB/s | 13 MB 00:00 (165/171): pixman-0.38.4-3.el8_9.ppc64le.rpm 2.2 MB/s | 201 kB 00:00 (166/171): python36-3.6.8-38.module+el8.9.0+209 397 kB/s | 19 kB 00:00 (167/171): python36-devel-3.6.8-38.module+el8.9 288 kB/s | 17 kB 00:00 (168/171): python3-pip-9.0.3-23.el8_9.1.noarch. 417 kB/s | 20 kB 00:00 (169/171): platform-python-devel-3.6.8-56.el8_9 472 kB/s | 241 kB 00:00 (170/171): python36-rpm-macros-3.6.8-38.module+ 21 kB/s | 16 kB 00:00 (171/171): doxygen-1.8.14-12.el8.ppc64le.rpm 3.9 MB/s | 3.8 MB 00:00 -------------------------------------------------------------------------------- Total 27 MB/s | 1.3 GB 00:49 Running transaction check Transaction check succeeded. Running transaction test Transaction test succeeded. Running transaction Preparing : 1/1 Installing : libpng-2:1.6.34-5.el8.ppc64le 1/171 Installing : freetype-2.9.1-9.el8.ppc64le 2/171 Installing : libjpeg-turbo-1.5.3-12.el8.ppc64le 3/171 Installing : libICE-1.0.9-15.el8.ppc64le 4/171 Installing : emacs-filesystem-1:26.1-11.el8.noarch 5/171 Installing : fontpackages-filesystem-1.44-22.el8.noarch 6/171 Installing : urw-base35-fonts-common-20170801-10.el8.noarch 7/171 Installing : cuda-toolkit-config-common-12.4.127-1.noarch 8/171 Installing : cuda-toolkit-12-config-common-12.4.127-1.noarch 9/171 Installing : cuda-toolkit-12-4-config-common-12.4.127-1.noarc 10/171 Installing : google-droid-sans-fonts-20120715-13.el8.noarch 11/171 Installing : fontconfig-2.13.1-4.el8.ppc64le 12/171 Running scriptlet: fontconfig-2.13.1-4.el8.ppc64le 12/171 Installing : libSM-1.2.3-1.el8.ppc64le 13/171 Installing : cmake-rpm-macros-3.26.5-1.el8_9.noarch 14/171 Installing : cmake-filesystem-3.26.5-1.el8_9.ppc64le 15/171 Installing : adobe-mappings-cmap-20171205-3.el8.noarch 16/171 Installing : atk-2.28.1-1.el8.ppc64le 17/171 Installing : adobe-mappings-cmap-deprecated-20171205-3.el8.no 18/171 Installing : cuda-cudart-12-4-12.4.127-1.ppc64le 19/171 Running scriptlet: cuda-cudart-12-4-12.4.127-1.ppc64le 19/171 Installing : libcublas-12-4-12.4.5.8-1.ppc64le 20/171 Running scriptlet: libcublas-12-4-12.4.5.8-1.ppc64le 20/171 Installing : libcurand-12-4-10.3.5.147-1.ppc64le 21/171 Running scriptlet: libcurand-12-4-10.3.5.147-1.ppc64le 21/171 Installing : libidn-1.34-5.el8.ppc64le 22/171 Running scriptlet: libidn-1.34-5.el8.ppc64le 22/171 Installing : jasper-libs-2.0.14-5.el8.ppc64le 23/171 Installing : pixman-0.38.4-3.el8_9.ppc64le 24/171 Installing : libwebp-1.0.0-9.el8_9.1.ppc64le 25/171 Installing : libX11-common-1.6.8-6.el8.noarch 26/171 Installing : python3-rpm-generators-5-8.el8.noarch 27/171 Installing : platform-python-devel-3.6.8-56.el8_9.3.ppc64le 28/171 Installing : openjpeg2-2.4.0-5.el8.ppc64le 29/171 Installing : fribidi-1.0.4-9.el8.ppc64le 30/171 Installing : vim-filesystem-2:8.0.1763-19.el8_6.4.noarch 31/171 Installing : libuv-1:1.41.1-1.el8_4.ppc64le 32/171 Installing : cmake-3.26.5-1.el8_9.ppc64le 33/171 Installing : cmake-data-3.26.5-1.el8_9.noarch 34/171 Installing : jbig2dec-libs-0.16-1.el8.ppc64le 35/171 Running scriptlet: jbig2dec-libs-0.16-1.el8.ppc64le 35/171 Installing : libXau-1.0.9-3.el8.ppc64le 36/171 Installing : libxcb-1.13.1-1.el8.ppc64le 37/171 Installing : libX11-1.6.8-6.el8.ppc64le 38/171 Installing : libXext-1.3.4-1.el8.ppc64le 39/171 Installing : libXrender-0.9.10-7.el8.ppc64le 40/171 Installing : cairo-1.15.12-6.el8.ppc64le 41/171 Installing : libXt-1.1.5-12.el8.ppc64le 42/171 Installing : libXmu-1.1.3-1.el8.ppc64le 43/171 Installing : libXfixes-5.0.3-7.el8.ppc64le 44/171 Installing : libXpm-3.5.12-9.el8_7.ppc64le 45/171 Installing : libXcursor-1.1.15-3.el8.ppc64le 46/171 Installing : libXrandr-1.5.2-1.el8.ppc64le 47/171 Installing : libXinerama-1.1.4-1.el8.ppc64le 48/171 Installing : libXi-1.7.10-1.el8.ppc64le 49/171 Installing : libXaw-1.0.13-10.el8.ppc64le 50/171 Installing : libXdamage-1.1.4-14.el8.ppc64le 51/171 Installing : libXft-2.3.3-1.el8.ppc64le 52/171 Installing : libXxf86misc-1.0.4-1.el8.ppc64le 53/171 Installing : libXxf86vm-1.1.4-9.el8.ppc64le 54/171 Installing : libXcomposite-0.4.4-14.el8.ppc64le 55/171 Installing : hicolor-icon-theme-0.17-2.el8.noarch 56/171 Installing : adobe-mappings-pdf-20180407-1.el8.noarch 57/171 Installing : libpaper-1.1.24-22.el8.ppc64le 58/171 Installing : libmcpp-2.7.2-20.el8.ppc64le 59/171 Running scriptlet: libmcpp-2.7.2-20.el8.ppc64le 59/171 Installing : mcpp-2.7.2-20.el8.ppc64le 60/171 Installing : xorg-x11-server-utils-7.7-27.el8.ppc64le 61/171 Installing : libijs-0.35-5.el8.ppc64le 62/171 Installing : libfontenc-1.1.3-8.el8.ppc64le 63/171 Installing : xorg-x11-font-utils-1:7.5-41.el8.ppc64le 64/171 Installing : urw-base35-gothic-fonts-20170801-10.el8.noarch 65/171 Running scriptlet: urw-base35-gothic-fonts-20170801-10.el8.noarch 65/171 Installing : urw-base35-p052-fonts-20170801-10.el8.noarch 66/171 Running scriptlet: urw-base35-p052-fonts-20170801-10.el8.noarch 66/171 Installing : urw-base35-bookman-fonts-20170801-10.el8.noarch 67/171 Running scriptlet: urw-base35-bookman-fonts-20170801-10.el8.noarch 67/171 Installing : urw-base35-c059-fonts-20170801-10.el8.noarch 68/171 Running scriptlet: urw-base35-c059-fonts-20170801-10.el8.noarch 68/171 Installing : urw-base35-d050000l-fonts-20170801-10.el8.noarch 69/171 Running scriptlet: urw-base35-d050000l-fonts-20170801-10.el8.noarch 69/171 Installing : urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 70/171 Running scriptlet: urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 70/171 Installing : urw-base35-nimbus-roman-fonts-20170801-10.el8.no 71/171 Running scriptlet: urw-base35-nimbus-roman-fonts-20170801-10.el8.no 71/171 Installing : urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 72/171 Running scriptlet: urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 72/171 Installing : urw-base35-standard-symbols-ps-fonts-20170801-10 73/171 Running scriptlet: urw-base35-standard-symbols-ps-fonts-20170801-10 73/171 Installing : urw-base35-z003-fonts-20170801-10.el8.noarch 74/171 Running scriptlet: urw-base35-z003-fonts-20170801-10.el8.noarch 74/171 Installing : urw-base35-fonts-20170801-10.el8.noarch 75/171 Installing : xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 76/171 Running scriptlet: xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 76/171 Installing : jbigkit-libs-2.1-14.el8.ppc64le 77/171 Running scriptlet: jbigkit-libs-2.1-14.el8.ppc64le 77/171 Installing : libtiff-4.0.9-29.el8_8.ppc64le 78/171 Installing : gd-2.2.5-7.el8.ppc64le 79/171 Running scriptlet: gd-2.2.5-7.el8.ppc64le 79/171 Installing : libdatrie-0.2.9-7.el8.ppc64le 80/171 Running scriptlet: libdatrie-0.2.9-7.el8.ppc64le 80/171 Installing : libthai-0.1.27-2.el8.ppc64le 81/171 Running scriptlet: libthai-0.1.27-2.el8.ppc64le 81/171 Installing : lcms2-2.9-2.el8.ppc64le 82/171 Running scriptlet: lcms2-2.9-2.el8.ppc64le 82/171 Installing : graphite2-1.3.10-10.el8.ppc64le 83/171 Installing : harfbuzz-1.7.5-3.el8.ppc64le 84/171 Running scriptlet: harfbuzz-1.7.5-3.el8.ppc64le 84/171 Installing : pango-1.42.4-8.el8.ppc64le 85/171 Running scriptlet: pango-1.42.4-8.el8.ppc64le 85/171 Installing : platform-python-pip-9.0.3-23.el8_9.1.noarch 86/171 Installing : less-530-2.el8_9.ppc64le 87/171 Running scriptlet: openssh-8.0p1-19.el8_9.2.ppc64le 88/171 Installing : openssh-8.0p1-19.el8_9.2.ppc64le 88/171 Installing : openssl-1:1.1.1k-12.el8_9.ppc64le 89/171 Installing : dbus-libs-1:1.12.8-26.el8.ppc64le 90/171 Running scriptlet: dbus-libs-1:1.12.8-26.el8.ppc64le 90/171 Installing : avahi-libs-0.7-21.el8_9.1.ppc64le 91/171 Installing : cups-libs-1:2.2.6-54.el8_9.ppc64le 92/171 Installing : libgs-9.27-11.el8.ppc64le 93/171 Installing : python3-setuptools-39.2.0-7.el8.noarch 94/171 Installing : python3-pip-9.0.3-23.el8_9.1.noarch 95/171 Installing : python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 96/171 Running scriptlet: python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 96/171 Installing : libcroco-0.6.12-4.el8_2.1.ppc64le 97/171 Running scriptlet: libcroco-0.6.12-4.el8_2.1.ppc64le 97/171 Installing : shared-mime-info-1.9-3.el8.ppc64le 98/171 Running scriptlet: shared-mime-info-1.9-3.el8.ppc64le 98/171 Installing : gdk-pixbuf2-2.36.12-5.el8.ppc64le 99/171 Running scriptlet: gdk-pixbuf2-2.36.12-5.el8.ppc64le 99/171 Installing : gdk-pixbuf2-modules-2.36.12-5.el8.ppc64le 100/171 Installing : gtk-update-icon-cache-3.22.30-11.el8.ppc64le 101/171 Installing : gtk2-2.24.32-5.el8.ppc64le 102/171 Running scriptlet: gtk2-2.24.32-5.el8.ppc64le 102/171 Installing : librsvg2-2.42.7-5.el8.ppc64le 103/171 Installing : libedit-3.1-23.20170329cvs.el8.ppc64le 104/171 Installing : openssh-clients-8.0p1-19.el8_9.2.ppc64le 105/171 Installing : git-core-2.39.3-1.el8_8.ppc64le 106/171 Installing : git-core-doc-2.39.3-1.el8_8.noarch 107/171 Installing : groff-base-1.22.3-18.el8.ppc64le 108/171 Installing : perl-Digest-1.17-395.el8.noarch 109/171 Installing : perl-Digest-MD5-2.55-396.el8.ppc64le 110/171 Installing : perl-Data-Dumper-2.167-399.el8.ppc64le 111/171 Installing : perl-libnet-3.11-3.el8.noarch 112/171 Installing : perl-URI-1.73-3.el8.noarch 113/171 Installing : perl-Pod-Escapes-1:1.07-395.el8.noarch 114/171 Installing : perl-Time-Local-1:1.280-1.el8.noarch 115/171 Installing : perl-IO-Socket-IP-0.39-5.el8.noarch 116/171 Installing : perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9 117/171 Installing : perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f089 118/171 Installing : perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+5 119/171 Installing : perl-Term-ANSIColor-4.06-396.el8.noarch 120/171 Installing : perl-Term-Cap-1.17-395.el8.noarch 121/171 Installing : perl-File-Temp-0.230.600-1.el8.noarch 122/171 Installing : perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 123/171 Installing : perl-Pod-Simple-1:3.35-395.el8.noarch 124/171 Installing : perl-podlators-4.11-1.el8.noarch 125/171 Installing : perl-Pod-Perldoc-3.28-396.el8.noarch 126/171 Installing : perl-Text-ParseWords-3.30-395.el8.noarch 127/171 Installing : perl-Pod-Usage-4:1.69-395.el8.noarch 128/171 Installing : perl-MIME-Base64-3.15-396.el8.ppc64le 129/171 Installing : perl-Storable-1:3.11-3.el8.ppc64le 130/171 Installing : perl-Getopt-Long-1:2.50-4.el8.noarch 131/171 Installing : perl-Socket-4:2.027-3.el8.ppc64le 132/171 Installing : perl-Errno-1.28-422.el8.ppc64le 133/171 Installing : perl-Encode-4:2.97-3.el8.ppc64le 134/171 Installing : perl-Scalar-List-Utils-3:1.49-2.el8.ppc64le 135/171 Installing : perl-Carp-1.42-396.el8.noarch 136/171 Installing : perl-Exporter-5.72-396.el8.noarch 137/171 Installing : perl-libs-4:5.26.3-422.el8.ppc64le 138/171 Installing : perl-parent-1:0.237-1.el8.noarch 139/171 Installing : perl-macros-4:5.26.3-422.el8.ppc64le 140/171 Installing : perl-Unicode-Normalize-1.25-396.el8.ppc64le 141/171 Installing : perl-threads-shared-1.58-2.el8.ppc64le 142/171 Installing : perl-threads-1:2.21-2.el8.ppc64le 143/171 Installing : perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch 144/171 Installing : perl-File-Path-2.15-2.el8.noarch 145/171 Installing : perl-PathTools-3.74-1.el8.ppc64le 146/171 Installing : perl-constant-1.33-396.el8.noarch 147/171 Installing : perl-IO-1.38-422.el8.ppc64le 148/171 Installing : perl-interpreter-4:5.26.3-422.el8.ppc64le 149/171 Installing : perl-TermReadKey-2.37-7.el8.ppc64le 150/171 Installing : perl-Error-1:0.17025-2.el8.noarch 151/171 Installing : perl-Git-2.39.3-1.el8_8.noarch 152/171 Installing : git-2.39.3-1.el8_8.ppc64le 153/171 Installing : cuda-nvvm-12-4-12.4.131-1.ppc64le 154/171 Installing : cuda-nvrtc-12-4-12.4.127-1.ppc64le 155/171 Running scriptlet: cuda-nvrtc-12-4-12.4.127-1.ppc64le 155/171 Installing : cuda-crt-12-4-12.4.131-1.ppc64le 156/171 Installing : cuda-cccl-12-4-12.4.127-1.ppc64le 157/171 Installing : libcudnn8-8.9.7.29-2.cuda12.3.ppc64le 158/171 Installing : libcudnn8-devel-8.9.7.29-2.cuda12.3.ppc64le 159/171 Running scriptlet: libcudnn8-devel-8.9.7.29-2.cuda12.3.ppc64le 159/171 Installing : cuda-cudart-devel-12-4-12.4.127-1.ppc64le 160/171 Installing : cuda-nvcc-12-4-12.4.131-1.ppc64le 161/171 Installing : cuda-nvrtc-devel-12-4-12.4.127-1.ppc64le 162/171 Installing : doxygen-1:1.8.14-12.el8.ppc64le 163/171 Installing : graphviz-2.40.1-44.el8.ppc64le 164/171 Running scriptlet: graphviz-2.40.1-44.el8.ppc64le 164/171 Installing : python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 165/171 Running scriptlet: python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 165/171 Installing : libcurand-devel-12-4-10.3.5.147-1.ppc64le 166/171 Installing : libcublas-devel-12-4-12.4.5.8-1.ppc64le 167/171 Installing : python36-rpm-macros-3.6.8-38.module+el8.9.0+2097 168/171 Installing : cuda-nvtx-12-4-12.4.127-1.ppc64le 169/171 Installing : cuda-nvml-devel-12-4-12.4.127-1.ppc64le 170/171 Installing : cuda-driver-devel-12-4-12.4.127-1.ppc64le 171/171 Running scriptlet: cuda-toolkit-12-4-config-common-12.4.127-1.noarc 171/171 Running scriptlet: urw-base35-gothic-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-p052-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-bookman-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-c059-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-d050000l-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 171/171 Running scriptlet: urw-base35-nimbus-roman-fonts-20170801-10.el8.no 171/171 Running scriptlet: urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 171/171 Running scriptlet: urw-base35-standard-symbols-ps-fonts-20170801-10 171/171 Running scriptlet: urw-base35-z003-fonts-20170801-10.el8.noarch 171/171 Running scriptlet: cuda-driver-devel-12-4-12.4.127-1.ppc64le 171/171 Running scriptlet: fontconfig-2.13.1-4.el8.ppc64le 171/171 Running scriptlet: hicolor-icon-theme-0.17-2.el8.noarch 171/171 Running scriptlet: shared-mime-info-1.9-3.el8.ppc64le 171/171 Running scriptlet: gdk-pixbuf2-2.36.12-5.el8.ppc64le 171/171 Verifying : libcudnn8-8.9.7.29-2.cuda12.3.ppc64le 1/171 Verifying : libcudnn8-devel-8.9.7.29-2.cuda12.3.ppc64le 2/171 Verifying : cuda-toolkit-12-4-config-common-12.4.127-1.noarc 3/171 Verifying : cuda-toolkit-12-config-common-12.4.127-1.noarch 4/171 Verifying : cuda-toolkit-config-common-12.4.127-1.noarch 5/171 Verifying : cuda-cccl-12-4-12.4.127-1.ppc64le 6/171 Verifying : cuda-crt-12-4-12.4.131-1.ppc64le 7/171 Verifying : cuda-cudart-12-4-12.4.127-1.ppc64le 8/171 Verifying : cuda-cudart-devel-12-4-12.4.127-1.ppc64le 9/171 Verifying : cuda-driver-devel-12-4-12.4.127-1.ppc64le 10/171 Verifying : cuda-nvcc-12-4-12.4.131-1.ppc64le 11/171 Verifying : cuda-nvml-devel-12-4-12.4.127-1.ppc64le 12/171 Verifying : cuda-nvrtc-12-4-12.4.127-1.ppc64le 13/171 Verifying : cuda-nvrtc-devel-12-4-12.4.127-1.ppc64le 14/171 Verifying : cuda-nvtx-12-4-12.4.127-1.ppc64le 15/171 Verifying : cuda-nvvm-12-4-12.4.131-1.ppc64le 16/171 Verifying : libcublas-12-4-12.4.5.8-1.ppc64le 17/171 Verifying : libcublas-devel-12-4-12.4.5.8-1.ppc64le 18/171 Verifying : libcurand-12-4-10.3.5.147-1.ppc64le 19/171 Verifying : libcurand-devel-12-4-10.3.5.147-1.ppc64le 20/171 Verifying : groff-base-1.22.3-18.el8.ppc64le 21/171 Verifying : libedit-3.1-23.20170329cvs.el8.ppc64le 22/171 Verifying : libpng-2:1.6.34-5.el8.ppc64le 23/171 Verifying : perl-Data-Dumper-2.167-399.el8.ppc64le 24/171 Verifying : perl-Encode-4:2.97-3.el8.ppc64le 25/171 Verifying : perl-MIME-Base64-3.15-396.el8.ppc64le 26/171 Verifying : perl-PathTools-3.74-1.el8.ppc64le 27/171 Verifying : perl-Scalar-List-Utils-3:1.49-2.el8.ppc64le 28/171 Verifying : perl-Storable-1:3.11-3.el8.ppc64le 29/171 Verifying : perl-Unicode-Normalize-1.25-396.el8.ppc64le 30/171 Verifying : perl-threads-1:2.21-2.el8.ppc64le 31/171 Verifying : perl-threads-shared-1.58-2.el8.ppc64le 32/171 Verifying : shared-mime-info-1.9-3.el8.ppc64le 33/171 Verifying : fontpackages-filesystem-1.44-22.el8.noarch 34/171 Verifying : perl-Carp-1.42-396.el8.noarch 35/171 Verifying : perl-Exporter-5.72-396.el8.noarch 36/171 Verifying : perl-File-Path-2.15-2.el8.noarch 37/171 Verifying : perl-File-Temp-0.230.600-1.el8.noarch 38/171 Verifying : perl-Getopt-Long-1:2.50-4.el8.noarch 39/171 Verifying : perl-Pod-Escapes-1:1.07-395.el8.noarch 40/171 Verifying : perl-Pod-Perldoc-3.28-396.el8.noarch 41/171 Verifying : perl-Pod-Simple-1:3.35-395.el8.noarch 42/171 Verifying : perl-Pod-Usage-4:1.69-395.el8.noarch 43/171 Verifying : perl-Socket-4:2.027-3.el8.ppc64le 44/171 Verifying : perl-Term-ANSIColor-4.06-396.el8.noarch 45/171 Verifying : perl-Term-Cap-1.17-395.el8.noarch 46/171 Verifying : perl-Text-ParseWords-3.30-395.el8.noarch 47/171 Verifying : perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch 48/171 Verifying : perl-Time-Local-1:1.280-1.el8.noarch 49/171 Verifying : perl-constant-1.33-396.el8.noarch 50/171 Verifying : perl-parent-1:0.237-1.el8.noarch 51/171 Verifying : perl-podlators-4.11-1.el8.noarch 52/171 Verifying : gdk-pixbuf2-2.36.12-5.el8.ppc64le 53/171 Verifying : libcroco-0.6.12-4.el8_2.1.ppc64le 54/171 Verifying : fontconfig-2.13.1-4.el8.ppc64le 55/171 Verifying : freetype-2.9.1-9.el8.ppc64le 56/171 Verifying : perl-IO-1.38-422.el8.ppc64le 57/171 Verifying : perl-interpreter-4:5.26.3-422.el8.ppc64le 58/171 Verifying : perl-libs-4:5.26.3-422.el8.ppc64le 59/171 Verifying : perl-macros-4:5.26.3-422.el8.ppc64le 60/171 Verifying : emacs-filesystem-1:26.1-11.el8.noarch 61/171 Verifying : perl-Errno-1.28-422.el8.ppc64le 62/171 Verifying : perl-URI-1.73-3.el8.noarch 63/171 Verifying : python3-setuptools-39.2.0-7.el8.noarch 64/171 Verifying : avahi-libs-0.7-21.el8_9.1.ppc64le 65/171 Verifying : cups-libs-1:2.2.6-54.el8_9.ppc64le 66/171 Verifying : dbus-libs-1:1.12.8-26.el8.ppc64le 67/171 Verifying : openssl-1:1.1.1k-12.el8_9.ppc64le 68/171 Verifying : perl-Digest-1.17-395.el8.noarch 69/171 Verifying : perl-Digest-MD5-2.55-396.el8.ppc64le 70/171 Verifying : perl-IO-Socket-IP-0.39-5.el8.noarch 71/171 Verifying : perl-libnet-3.11-3.el8.noarch 72/171 Verifying : openssh-8.0p1-19.el8_9.2.ppc64le 73/171 Verifying : openssh-clients-8.0p1-19.el8_9.2.ppc64le 74/171 Verifying : less-530-2.el8_9.ppc64le 75/171 Verifying : perl-HTTP-Tiny-0.074-2.el8_9.1.noarch 76/171 Verifying : platform-python-pip-9.0.3-23.el8_9.1.noarch 77/171 Verifying : atk-2.28.1-1.el8.ppc64le 78/171 Verifying : graphite2-1.3.10-10.el8.ppc64le 79/171 Verifying : harfbuzz-1.7.5-3.el8.ppc64le 80/171 Verifying : lcms2-2.9-2.el8.ppc64le 81/171 Verifying : libXcursor-1.1.15-3.el8.ppc64le 82/171 Verifying : libXdamage-1.1.4-14.el8.ppc64le 83/171 Verifying : libXinerama-1.1.4-1.el8.ppc64le 84/171 Verifying : libXxf86misc-1.0.4-1.el8.ppc64le 85/171 Verifying : libdatrie-0.2.9-7.el8.ppc64le 86/171 Verifying : mcpp-2.7.2-20.el8.ppc64le 87/171 Verifying : perl-TermReadKey-2.37-7.el8.ppc64le 88/171 Verifying : jbigkit-libs-2.1-14.el8.ppc64le 89/171 Verifying : libSM-1.2.3-1.el8.ppc64le 90/171 Verifying : libXaw-1.0.13-10.el8.ppc64le 91/171 Verifying : libXcomposite-0.4.4-14.el8.ppc64le 92/171 Verifying : libXfixes-5.0.3-7.el8.ppc64le 93/171 Verifying : libXrender-0.9.10-7.el8.ppc64le 94/171 Verifying : libfontenc-1.1.3-8.el8.ppc64le 95/171 Verifying : libidn-1.34-5.el8.ppc64le 96/171 Verifying : libijs-0.35-5.el8.ppc64le 97/171 Verifying : libmcpp-2.7.2-20.el8.ppc64le 98/171 Verifying : libpaper-1.1.24-22.el8.ppc64le 99/171 Verifying : libthai-0.1.27-2.el8.ppc64le 100/171 Verifying : xorg-x11-server-utils-7.7-27.el8.ppc64le 101/171 Verifying : google-droid-sans-fonts-20120715-13.el8.noarch 102/171 Verifying : libXxf86vm-1.1.4-9.el8.ppc64le 103/171 Verifying : urw-base35-fonts-20170801-10.el8.noarch 104/171 Verifying : urw-base35-gothic-fonts-20170801-10.el8.noarch 105/171 Verifying : urw-base35-p052-fonts-20170801-10.el8.noarch 106/171 Verifying : adobe-mappings-cmap-20171205-3.el8.noarch 107/171 Verifying : adobe-mappings-cmap-deprecated-20171205-3.el8.no 108/171 Verifying : adobe-mappings-pdf-20180407-1.el8.noarch 109/171 Verifying : hicolor-icon-theme-0.17-2.el8.noarch 110/171 Verifying : perl-Error-1:0.17025-2.el8.noarch 111/171 Verifying : urw-base35-bookman-fonts-20170801-10.el8.noarch 112/171 Verifying : urw-base35-c059-fonts-20170801-10.el8.noarch 113/171 Verifying : urw-base35-d050000l-fonts-20170801-10.el8.noarch 114/171 Verifying : urw-base35-fonts-common-20170801-10.el8.noarch 115/171 Verifying : urw-base35-nimbus-mono-ps-fonts-20170801-10.el8. 116/171 Verifying : urw-base35-nimbus-roman-fonts-20170801-10.el8.no 117/171 Verifying : urw-base35-nimbus-sans-fonts-20170801-10.el8.noa 118/171 Verifying : urw-base35-standard-symbols-ps-fonts-20170801-10 119/171 Verifying : urw-base35-z003-fonts-20170801-10.el8.noarch 120/171 Verifying : xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarc 121/171 Verifying : gdk-pixbuf2-modules-2.36.12-5.el8.ppc64le 122/171 Verifying : libXt-1.1.5-12.el8.ppc64le 123/171 Verifying : libICE-1.0.9-15.el8.ppc64le 124/171 Verifying : libxcb-1.13.1-1.el8.ppc64le 125/171 Verifying : perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9 126/171 Verifying : libXft-2.3.3-1.el8.ppc64le 127/171 Verifying : perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+5 128/171 Verifying : gd-2.2.5-7.el8.ppc64le 129/171 Verifying : libXau-1.0.9-3.el8.ppc64le 130/171 Verifying : libXext-1.3.4-1.el8.ppc64le 131/171 Verifying : libXi-1.7.10-1.el8.ppc64le 132/171 Verifying : libXmu-1.1.3-1.el8.ppc64le 133/171 Verifying : libXrandr-1.5.2-1.el8.ppc64le 134/171 Verifying : gtk2-2.24.32-5.el8.ppc64le 135/171 Verifying : jbig2dec-libs-0.16-1.el8.ppc64le 136/171 Verifying : libuv-1:1.41.1-1.el8_4.ppc64le 137/171 Verifying : jasper-libs-2.0.14-5.el8.ppc64le 138/171 Verifying : libjpeg-turbo-1.5.3-12.el8.ppc64le 139/171 Verifying : pango-1.42.4-8.el8.ppc64le 140/171 Verifying : xorg-x11-font-utils-1:7.5-41.el8.ppc64le 141/171 Verifying : perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f089 142/171 Verifying : cairo-1.15.12-6.el8.ppc64le 143/171 Verifying : vim-filesystem-2:8.0.1763-19.el8_6.4.noarch 144/171 Verifying : fribidi-1.0.4-9.el8.ppc64le 145/171 Verifying : gtk-update-icon-cache-3.22.30-11.el8.ppc64le 146/171 Verifying : openjpeg2-2.4.0-5.el8.ppc64le 147/171 Verifying : libXpm-3.5.12-9.el8_7.ppc64le 148/171 Verifying : graphviz-2.40.1-44.el8.ppc64le 149/171 Verifying : git-2.39.3-1.el8_8.ppc64le 150/171 Verifying : git-core-2.39.3-1.el8_8.ppc64le 151/171 Verifying : git-core-doc-2.39.3-1.el8_8.noarch 152/171 Verifying : perl-Git-2.39.3-1.el8_8.noarch 153/171 Verifying : python3-rpm-generators-5-8.el8.noarch 154/171 Verifying : libtiff-4.0.9-29.el8_8.ppc64le 155/171 Verifying : libX11-1.6.8-6.el8.ppc64le 156/171 Verifying : libX11-common-1.6.8-6.el8.noarch 157/171 Verifying : libgs-9.27-11.el8.ppc64le 158/171 Verifying : librsvg2-2.42.7-5.el8.ppc64le 159/171 Verifying : libwebp-1.0.0-9.el8_9.1.ppc64le 160/171 Verifying : cmake-3.26.5-1.el8_9.ppc64le 161/171 Verifying : cmake-data-3.26.5-1.el8_9.noarch 162/171 Verifying : cmake-filesystem-3.26.5-1.el8_9.ppc64le 163/171 Verifying : cmake-rpm-macros-3.26.5-1.el8_9.noarch 164/171 Verifying : pixman-0.38.4-3.el8_9.ppc64le 165/171 Verifying : platform-python-devel-3.6.8-56.el8_9.3.ppc64le 166/171 Verifying : python36-3.6.8-38.module+el8.9.0+20976+d3c38525. 167/171 Verifying : python36-devel-3.6.8-38.module+el8.9.0+20976+d3c 168/171 Verifying : python36-rpm-macros-3.6.8-38.module+el8.9.0+2097 169/171 Verifying : python3-pip-9.0.3-23.el8_9.1.noarch 170/171 Verifying : doxygen-1:1.8.14-12.el8.ppc64le 171/171 Installed products updated. Installed: adobe-mappings-cmap-20171205-3.el8.noarch adobe-mappings-cmap-deprecated-20171205-3.el8.noarch adobe-mappings-pdf-20180407-1.el8.noarch atk-2.28.1-1.el8.ppc64le avahi-libs-0.7-21.el8_9.1.ppc64le cairo-1.15.12-6.el8.ppc64le cmake-3.26.5-1.el8_9.ppc64le cmake-data-3.26.5-1.el8_9.noarch cmake-filesystem-3.26.5-1.el8_9.ppc64le cmake-rpm-macros-3.26.5-1.el8_9.noarch cuda-cccl-12-4-12.4.127-1.ppc64le cuda-crt-12-4-12.4.131-1.ppc64le cuda-cudart-12-4-12.4.127-1.ppc64le cuda-cudart-devel-12-4-12.4.127-1.ppc64le cuda-driver-devel-12-4-12.4.127-1.ppc64le cuda-nvcc-12-4-12.4.131-1.ppc64le cuda-nvml-devel-12-4-12.4.127-1.ppc64le cuda-nvrtc-12-4-12.4.127-1.ppc64le cuda-nvrtc-devel-12-4-12.4.127-1.ppc64le cuda-nvtx-12-4-12.4.127-1.ppc64le cuda-nvvm-12-4-12.4.131-1.ppc64le cuda-toolkit-12-4-config-common-12.4.127-1.noarch cuda-toolkit-12-config-common-12.4.127-1.noarch cuda-toolkit-config-common-12.4.127-1.noarch cups-libs-1:2.2.6-54.el8_9.ppc64le dbus-libs-1:1.12.8-26.el8.ppc64le doxygen-1:1.8.14-12.el8.ppc64le emacs-filesystem-1:26.1-11.el8.noarch fontconfig-2.13.1-4.el8.ppc64le fontpackages-filesystem-1.44-22.el8.noarch freetype-2.9.1-9.el8.ppc64le fribidi-1.0.4-9.el8.ppc64le gd-2.2.5-7.el8.ppc64le gdk-pixbuf2-2.36.12-5.el8.ppc64le gdk-pixbuf2-modules-2.36.12-5.el8.ppc64le git-2.39.3-1.el8_8.ppc64le git-core-2.39.3-1.el8_8.ppc64le git-core-doc-2.39.3-1.el8_8.noarch google-droid-sans-fonts-20120715-13.el8.noarch graphite2-1.3.10-10.el8.ppc64le graphviz-2.40.1-44.el8.ppc64le groff-base-1.22.3-18.el8.ppc64le gtk-update-icon-cache-3.22.30-11.el8.ppc64le gtk2-2.24.32-5.el8.ppc64le harfbuzz-1.7.5-3.el8.ppc64le hicolor-icon-theme-0.17-2.el8.noarch jasper-libs-2.0.14-5.el8.ppc64le jbig2dec-libs-0.16-1.el8.ppc64le jbigkit-libs-2.1-14.el8.ppc64le lcms2-2.9-2.el8.ppc64le less-530-2.el8_9.ppc64le libICE-1.0.9-15.el8.ppc64le libSM-1.2.3-1.el8.ppc64le libX11-1.6.8-6.el8.ppc64le libX11-common-1.6.8-6.el8.noarch libXau-1.0.9-3.el8.ppc64le libXaw-1.0.13-10.el8.ppc64le libXcomposite-0.4.4-14.el8.ppc64le libXcursor-1.1.15-3.el8.ppc64le libXdamage-1.1.4-14.el8.ppc64le libXext-1.3.4-1.el8.ppc64le libXfixes-5.0.3-7.el8.ppc64le libXft-2.3.3-1.el8.ppc64le libXi-1.7.10-1.el8.ppc64le libXinerama-1.1.4-1.el8.ppc64le libXmu-1.1.3-1.el8.ppc64le libXpm-3.5.12-9.el8_7.ppc64le libXrandr-1.5.2-1.el8.ppc64le libXrender-0.9.10-7.el8.ppc64le libXt-1.1.5-12.el8.ppc64le libXxf86misc-1.0.4-1.el8.ppc64le libXxf86vm-1.1.4-9.el8.ppc64le libcroco-0.6.12-4.el8_2.1.ppc64le libcublas-12-4-12.4.5.8-1.ppc64le libcublas-devel-12-4-12.4.5.8-1.ppc64le libcudnn8-8.9.7.29-2.cuda12.3.ppc64le libcudnn8-devel-8.9.7.29-2.cuda12.3.ppc64le libcurand-12-4-10.3.5.147-1.ppc64le libcurand-devel-12-4-10.3.5.147-1.ppc64le libdatrie-0.2.9-7.el8.ppc64le libedit-3.1-23.20170329cvs.el8.ppc64le libfontenc-1.1.3-8.el8.ppc64le libgs-9.27-11.el8.ppc64le libidn-1.34-5.el8.ppc64le libijs-0.35-5.el8.ppc64le libjpeg-turbo-1.5.3-12.el8.ppc64le libmcpp-2.7.2-20.el8.ppc64le libpaper-1.1.24-22.el8.ppc64le libpng-2:1.6.34-5.el8.ppc64le librsvg2-2.42.7-5.el8.ppc64le libthai-0.1.27-2.el8.ppc64le libtiff-4.0.9-29.el8_8.ppc64le libuv-1:1.41.1-1.el8_4.ppc64le libwebp-1.0.0-9.el8_9.1.ppc64le libxcb-1.13.1-1.el8.ppc64le mcpp-2.7.2-20.el8.ppc64le openjpeg2-2.4.0-5.el8.ppc64le openssh-8.0p1-19.el8_9.2.ppc64le openssh-clients-8.0p1-19.el8_9.2.ppc64le openssl-1:1.1.1k-12.el8_9.ppc64le pango-1.42.4-8.el8.ppc64le perl-Carp-1.42-396.el8.noarch perl-Data-Dumper-2.167-399.el8.ppc64le perl-Digest-1.17-395.el8.noarch perl-Digest-MD5-2.55-396.el8.ppc64le perl-Encode-4:2.97-3.el8.ppc64le perl-Errno-1.28-422.el8.ppc64le perl-Error-1:0.17025-2.el8.noarch perl-Exporter-5.72-396.el8.noarch perl-File-Path-2.15-2.el8.noarch perl-File-Temp-0.230.600-1.el8.noarch perl-Getopt-Long-1:2.50-4.el8.noarch perl-Git-2.39.3-1.el8_8.noarch perl-HTTP-Tiny-0.074-2.el8_9.1.noarch perl-IO-1.38-422.el8.ppc64le perl-IO-Socket-IP-0.39-5.el8.noarch perl-IO-Socket-SSL-2.066-4.module+el8.3.0+6446+594cad75.noarch perl-MIME-Base64-3.15-396.el8.ppc64le perl-Mozilla-CA-20160104-7.module+el8.3.0+6498+9eecfe51.noarch perl-Net-SSLeay-1.88-2.module+el8.6.0+13392+f0897f98.ppc64le perl-PathTools-3.74-1.el8.ppc64le perl-Pod-Escapes-1:1.07-395.el8.noarch perl-Pod-Perldoc-3.28-396.el8.noarch perl-Pod-Simple-1:3.35-395.el8.noarch perl-Pod-Usage-4:1.69-395.el8.noarch perl-Scalar-List-Utils-3:1.49-2.el8.ppc64le perl-Socket-4:2.027-3.el8.ppc64le perl-Storable-1:3.11-3.el8.ppc64le perl-Term-ANSIColor-4.06-396.el8.noarch perl-Term-Cap-1.17-395.el8.noarch perl-TermReadKey-2.37-7.el8.ppc64le perl-Text-ParseWords-3.30-395.el8.noarch perl-Text-Tabs+Wrap-2013.0523-395.el8.noarch perl-Time-Local-1:1.280-1.el8.noarch perl-URI-1.73-3.el8.noarch perl-Unicode-Normalize-1.25-396.el8.ppc64le perl-constant-1.33-396.el8.noarch perl-interpreter-4:5.26.3-422.el8.ppc64le perl-libnet-3.11-3.el8.noarch perl-libs-4:5.26.3-422.el8.ppc64le perl-macros-4:5.26.3-422.el8.ppc64le perl-parent-1:0.237-1.el8.noarch perl-podlators-4.11-1.el8.noarch perl-threads-1:2.21-2.el8.ppc64le perl-threads-shared-1.58-2.el8.ppc64le pixman-0.38.4-3.el8_9.ppc64le platform-python-devel-3.6.8-56.el8_9.3.ppc64le platform-python-pip-9.0.3-23.el8_9.1.noarch python3-pip-9.0.3-23.el8_9.1.noarch python3-rpm-generators-5-8.el8.noarch python3-setuptools-39.2.0-7.el8.noarch python36-3.6.8-38.module+el8.9.0+20976+d3c38525.ppc64le python36-devel-3.6.8-38.module+el8.9.0+20976+d3c38525.ppc64le python36-rpm-macros-3.6.8-38.module+el8.9.0+20976+d3c38525.noarch shared-mime-info-1.9-3.el8.ppc64le urw-base35-bookman-fonts-20170801-10.el8.noarch urw-base35-c059-fonts-20170801-10.el8.noarch urw-base35-d050000l-fonts-20170801-10.el8.noarch urw-base35-fonts-20170801-10.el8.noarch urw-base35-fonts-common-20170801-10.el8.noarch urw-base35-gothic-fonts-20170801-10.el8.noarch urw-base35-nimbus-mono-ps-fonts-20170801-10.el8.noarch urw-base35-nimbus-roman-fonts-20170801-10.el8.noarch urw-base35-nimbus-sans-fonts-20170801-10.el8.noarch urw-base35-p052-fonts-20170801-10.el8.noarch urw-base35-standard-symbols-ps-fonts-20170801-10.el8.noarch urw-base35-z003-fonts-20170801-10.el8.noarch vim-filesystem-2:8.0.1763-19.el8_6.4.noarch xorg-x11-font-utils-1:7.5-41.el8.ppc64le xorg-x11-fonts-ISO8859-1-100dpi-7.5-19.el8.noarch xorg-x11-server-utils-7.7-27.el8.ppc64le Complete! Finish: build setup for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Start: rpmbuild cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm sh: -c: line 0: unexpected EOF while looking for matching `"' sh: -c: line 1: syntax error: unexpected end of file Building target platforms: ppc64le Building for target ppc64le Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.uVycJd + umask 022 + cd /builddir/build/BUILD + cd /builddir/build/BUILD + rm -rf cutlass + /usr/bin/mkdir -p cutlass + cd cutlass + /usr/bin/chmod -Rf a+rX,u+w,g-w,o-w . + git clone --depth 1 -n -b v3.5.0 https://github.com/NVIDIA/cutlass.git . Cloning into '.'... + git reset --hard v3.5.0 HEAD is now at 7d49e6c Updates for CUTLASS 3.5.0 (#1468) + git log --format=fuller commit 7d49e6c7e2f8896c47f586706e67e1fb215529dc Author: Vijay Thakkar AuthorDate: Thu Apr 11 21:33:40 2024 -0400 Commit: GitHub CommitDate: Thu Apr 11 21:33:40 2024 -0400 Updates for CUTLASS 3.5.0 (#1468) Patch #0 (cutlass-fp16.patch): + echo 'Patch #0 (cutlass-fp16.patch):' + /usr/bin/patch --no-backup-if-mismatch -p0 -b --suffix .fp16~ --fuzz=100 patching file include/cutlass/functional.h Hunk #1 succeeded at 217 with fuzz 3 (offset 128 lines). + sed -i /-rpath/d CMakeLists.txt + exit 0 Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.oWcJ69 + umask 022 + cd /builddir/build/BUILD + cd cutlass + mkdir -p build ~/build/BUILD/cutlass/build ~/build/BUILD/cutlass + pushd build + export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64/ + LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64/ + CFLAGS= + export CFLAGS + CXXFLAGS= + export CXXFLAGS + FFLAGS=' -I/usr/lib64/gfortran/modules' + export FFLAGS + FCFLAGS=' -I/usr/lib64/gfortran/modules' + export FCFLAGS + LDFLAGS='-Wl,-z,relro ' + export LDFLAGS + /usr/bin/cmake -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON .. -DCMAKE_SKIP_RPATH=ON -DCMAKE_VERBOSE_MAKEFILE=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_EXE_LINKER_FLAGS=/usr/lib64/libstdc++.so.6 -DBUILD_TESTING=OFF -DCUTLASS_ENABLE_TESTS=OFF -DCUTLASS_ENABLE_PROFILER=ON -DCUTLASS_ENABLE_EXAMPLES=OFF -DCUDA_PROPAGATE_HOST_FLAGS=OFF -DCUTLASS_NVCC_EMBED_PTX=ON -DCUTLASS_NVCC_EMBED_CUBIN=ON '-DCUTLASS_NVCC_ARCHS=52;61;75;86;89;90' '-DCMAKE_CUDA_FLAGS=-Wl,--no-relax -Xfatbin=-compress-all --compiler-options -fPIC -Wno-deprecated-gpu-targets -allow-unsupported-compiler -D_SERIALIZE_H_INCLUDED' -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.4/bin/nvcc -- CMake Version: 3.26.5 -- CUTLASS 3.5.0 -- The CXX compiler identification is GNU 8.5.0 -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- The CUDA compiler identification is NVIDIA 12.4.131 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/local/cuda-12.4/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- CUDART: /usr/local/cuda-12.4/lib64/libcudart.so -- CUDA Driver: /usr/local/cuda-12.4/lib64/stubs/libcuda.so -- NVRTC: /usr/local/cuda-12.4/lib64/libnvrtc.so -- Default Install Location: /usr -- Found Python3: /usr/bin/python3.6 (found suitable version "3.6.8", minimum required is "3.5") found components: Interpreter CMake Warning at CMakeLists.txt:156 (message): Using unsupported or deprecated compute capabilities 52;61. Support may be removed in future versions. -- CUDA Compilation Architectures: 52;61;75;86;89;90 -- Enable caching of reference results in conv unit tests -- Enable rigorous conv problem sizes in conv unit tests -- Using NVCC flags: --expt-relaxed-constexpr;-DCUTLASS_TEST_LEVEL=0;-DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1;-DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1;-DCUTLASS_DEBUG_TRACE_LEVEL=0;-Xcompiler=-Wconversion;-Xcompiler=-fno-strict-aliasing -- CUTLASS Revision: 7d49e6c -- Configuring cublas ... -- cuBLAS Disabled. -- Configuring cuBLAS ... done. -- Completed generation of library instances. See /builddir/build/BUILD/cutlass/build/tools/library/library_instance_generation.log for more information. -- Configuring done (7.3s) -- Generating done (3.1s) CMake Warning: Manually-specified variables were not used by the project: CMAKE_C_FLAGS_RELEASE CMAKE_Fortran_FLAGS_RELEASE CUDA_PROPAGATE_HOST_FLAGS INCLUDE_INSTALL_DIR LIB_INSTALL_DIR LIB_SUFFIX SHARE_INSTALL_PREFIX SYSCONF_INSTALL_DIR -- Build files have been written to: /builddir/build/BUILD/cutlass/build + make -j2 [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/all_sm90_z1684symm_symm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/handle.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 0%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/src/manifest.cpp.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/operation_table.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/singleton.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/util.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int4.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684symm_objs.dir/generated/symm/90/z1684symm/cutlass_tensorop_z1684symm_128x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 0%] Built target cutlass_library_symm_sm90_z1684symm_objs [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/all_sm50_cgemm_gemm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_nn_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_nt_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_canonical.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_tn_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_cgemm_objs.dir/generated/gemm/50/cgemm/cutlass_simt_cgemm_128x64_8x2_tt_align1.cu.o [ 0%] Built target cutlass_library_gemm_sm50_cgemm_objs [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/all_sm50_dgemm_gemm_operations.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_nn_align1.cu.o [ 0%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_dgemm_objs.dir/generated/gemm/50/dgemm/cutlass_simt_dgemm_128x128_8x2_tt_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_dgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/all_sm50_sgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm50_sgemm_objs.dir/generated/gemm/50/sgemm/cutlass_simt_sgemm_128x128_8x2_tt_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm50_sgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/all_sm60_hgemm_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_32.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_int8_interleaved_64.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm60_hgemm_objs.dir/generated/gemm/60/hgemm/cutlass_simt_hgemm_256x128_8x2_tt_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm60_hgemm_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/all_sm61_igemm_s8_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e4m3out.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_igemm_s8_objs.dir/generated/gemm/61/igemm_s8/cutlass_simt_igemm_s8_128x128_32x2_tt_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm61_igemm_s8_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/all_sm61_s8_igemm_s8_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_nn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_nt_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_tn_align1.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm61_s8_igemm_s8_objs.dir/generated/gemm/61/s8_igemm_s8/cutlass_simt_s8_igemm_s8_128x128_32x2_tt_align1.cu.o [ 1%] Built target cutlass_library_gemm_sm61_s8_igemm_s8_objs [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/all_sm70_f16_s884gemm_f16_gemm_operations.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nn_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_nt_align8.cu.o [ 1%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_f16_objs.dir/generated/gemm/70/f16_s884gemm_f16/cutlass_tensorop_f16_s884gemm_f16_256x128_32x2_tt_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/all_sm70_f16_s884gemm_planar_complex_array_f16_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_cn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_cc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ct_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_nh_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ch_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_tt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_ht_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_th_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_array_f16/cutlass_tensorop_f16_s884gemm_planar_complex_array_f16_64x64_32x2_hh_align8.cu.o [ 2%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_objs [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/all_sm70_f16_s884gemm_planar_complex_f16_gemm_operations.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_cn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_cc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ct_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_nh_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ch_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hn_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hc_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_tt_align8.cu.o [ 2%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e4m3out.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_ht_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_th_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/f16_s884gemm_planar_complex_f16/cutlass_tensorop_f16_s884gemm_planar_complex_f16_64x64_32x2_hh_align8.cu.o [ 3%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_objs [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/all_sm70_h884gemm_gemm_operations.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_nn_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_nt_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_tn_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_objs.dir/generated/gemm/70/h884gemm/cutlass_tensorop_h884gemm_256x128_32x2_tt_align8.cu.o [ 3%] Built target cutlass_library_gemm_sm70_h884gemm_objs [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/all_sm70_h884gemm_planar_complex_gemm_operations.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nn_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_cn_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nc_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_cc_align8.cu.o [ 3%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_objs.dir/generated/gemm/70/h884gemm_planar_complex/cutlass_tensorop_h884gemm_planar_complex_64x64_32x2_hh_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/all_sm70_h884gemm_planar_complex_array_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_cn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e4m3a_e5m2out.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_ht_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_th_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs.dir/generated/gemm/70/h884gemm_planar_complex_array/cutlass_tensorop_h884gemm_planar_complex_array_64x64_32x2_hh_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/all_sm70_s884gemm_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_f16_objs.dir/generated/gemm/70/s884gemm_f16/cutlass_tensorop_s884gemm_f16_256x128_32x2_tt_align8.cu.o [ 4%] Built target cutlass_library_gemm_sm70_s884gemm_f16_objs [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/all_sm70_s884gemm_planar_complex_array_f16_gemm_operations.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_cn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_cc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ct_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_nh_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ch_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hn_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hc_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_tt_align8.cu.o [ 4%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_ht_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_th_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_array_f16/cutlass_tensorop_s884gemm_planar_complex_array_f16_64x64_32x2_hh_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/all_sm70_s884gemm_planar_complex_f16_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_cn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_cc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nt_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ct_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_nh_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_e5m2a_e5m2out.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ch_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_tt_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_ht_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_th_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs.dir/generated/gemm/70/s884gemm_planar_complex_f16/cutlass_tensorop_s884gemm_planar_complex_f16_64x64_32x2_hh_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/all_sm75_f16_s1688gemm_f16_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_nn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_nt_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs.dir/generated/gemm/75/f16_s1688gemm_f16/cutlass_tensorop_f16_s1688gemm_f16_256x128_32x2_tt_align8.cu.o [ 5%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16_objs [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/all_sm75_f16_s1688gemm_planar_complex_array_f16_gemm_operations.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_cn_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_cc_align8.cu.o [ 5%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ct_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_nh_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ch_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_tt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_ht_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_th_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_array_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_array_f16_64x128_32x2_hh_align8.cu.o [ 6%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_objs [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/all_sm75_f16_s1688gemm_planar_complex_f16_gemm_operations.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_cn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp16out.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_cc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ct_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_nh_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ch_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hc_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_tt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_ht_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_th_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/f16_s1688gemm_planar_complex_f16/cutlass_tensorop_f16_s1688gemm_planar_complex_f16_64x128_32x2_hh_align8.cu.o [ 6%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_objs [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/all_sm75_h1688gemm_gemm_operations.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_nn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_nt_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_tn_align8.cu.o [ 6%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_bf16out.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_objs.dir/generated/gemm/75/h1688gemm/cutlass_tensorop_h1688gemm_256x128_32x2_tt_align8.cu.o [ 7%] Built target cutlass_library_gemm_sm75_h1688gemm_objs [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/all_sm75_h1688gemm_planar_complex_gemm_operations.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_cn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_cc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ct_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_nh_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ch_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hn_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hc_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_tt_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_ht_align8.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp8in_fp32out.cu.o [ 7%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_th_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs.dir/generated/gemm/75/h1688gemm_planar_complex/cutlass_tensorop_h1688gemm_planar_complex_64x128_32x2_hh_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/all_sm75_h1688gemm_planar_complex_array_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_cn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_cc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ct_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_nh_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ch_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp32out.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_tt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_ht_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_th_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs.dir/generated/gemm/75/h1688gemm_planar_complex_array/cutlass_tensorop_h1688gemm_planar_complex_array_64x128_32x2_hh_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i88128xorgemm_b1_objs.dir/generated/gemm/75/i88128xorgemm_b1/all_sm75_i88128xorgemm_b1_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i88128xorgemm_b1_objs.dir/generated/gemm/75/i88128xorgemm_b1/cutlass_tensorop_i88128xorgemm_b1_256x128_512x2_tn_align128.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_s8_objs.dir/generated/gemm/75/i8816gemm_s8/all_sm75_i8816gemm_s8_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_s8_objs.dir/generated/gemm/75/i8816gemm_s8/cutlass_tensorop_i8816gemm_s8_256x128_64x2_tn_align16.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8816gemm_s8_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_u8_objs.dir/generated/gemm/75/i8816gemm_u8/all_sm75_i8816gemm_u8_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8816gemm_u8_objs.dir/generated/gemm/75/i8816gemm_u8/cutlass_tensorop_i8816gemm_u8_256x128_64x2_tn_align16.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8816gemm_u8_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_s4_objs.dir/generated/gemm/75/i8832gemm_s4/all_sm75_i8832gemm_s4_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_s4_objs.dir/generated/gemm/75/i8832gemm_s4/cutlass_tensorop_i8832gemm_s4_256x128_128x2_tn_align32.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_other.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8832gemm_s4_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_u4_objs.dir/generated/gemm/75/i8832gemm_u4/all_sm75_i8832gemm_u4_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_i8832gemm_u4_objs.dir/generated/gemm/75/i8832gemm_u4/cutlass_tensorop_i8832gemm_u4_256x128_128x2_tn_align32.cu.o [ 8%] Built target cutlass_library_gemm_sm75_i8832gemm_u4_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/gemm_fp_mixed_input.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/all_sm75_s1688gemm_f16_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_nn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_tn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_f16_objs.dir/generated/gemm/75/s1688gemm_f16/cutlass_tensorop_s1688gemm_f16_256x128_32x2_tt_align8.cu.o [ 8%] Built target cutlass_library_gemm_sm75_s1688gemm_f16_objs [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/all_sm75_s1688gemm_planar_complex_array_f16_gemm_operations.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_cn_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nc_align8.cu.o [ 8%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_cc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ct_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_nh_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ch_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/initialize_reference_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/reduction_device.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reduction/init_reduction_operations.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hn_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv2d.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hc_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/src/reference/conv3d.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_tt_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_ht_align8.cu.o [ 9%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_th_align8.cu.o [ 9%] Building CXX object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/initialize_all.cpp.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/gemm/all_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_array_f16/cutlass_tensorop_s1688gemm_planar_complex_array_f16_64x128_32x2_hh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv2d/all_conv2d_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/conv3d/all_conv3d_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/rank_k/all_rank_k_operations.cu.o [ 10%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/all_sm75_s1688gemm_planar_complex_f16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/rank_2k/all_rank_2k_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/trmm/all_trmm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_objs.dir/generated/symm/all_symm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_cn_align8.cu.o [ 10%] Built target cutlass_library_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/all_sm75_s4_i8832gemm_s4_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/cutlass_tensorop_s4_i8832gemm_s4_256x128_128x2_tn_align32.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs.dir/generated/gemm/75/s4_i8832gemm_s4/cutlass_tensorop_s4_i8832gemm_s4_256x128_128x2_n64t64_align32.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_cc_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ct_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_nh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ch_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_tt_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_ht_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_th_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs.dir/generated/gemm/75/s1688gemm_planar_complex_f16/cutlass_tensorop_s1688gemm_planar_complex_f16_64x128_32x2_hh_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/all_sm75_s8_i8816gemm_s8_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_tn_align16.cu.o [ 10%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/all_sm75_u4_i8832gemm_u4_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/cutlass_tensorop_u4_i8832gemm_u4_256x128_128x2_tn_align32.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs.dir/generated/gemm/75/s8_i8816gemm_s8/cutlass_tensorop_s8_i8816gemm_s8_256x128_64x2_n32t32_align16.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs.dir/generated/gemm/75/u4_i8832gemm_u4/cutlass_tensorop_u4_i8832gemm_u4_256x128_128x2_n64t64_align32.cu.o [ 10%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/all_sm75_u8_i8816gemm_u8_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/cutlass_tensorop_u8_i8816gemm_u8_256x128_64x2_tn_align16.cu.o [ 10%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/all_sm80_bf16_s16816gemm_bf16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_nn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs.dir/generated/gemm/75/u8_i8816gemm_u8/cutlass_tensorop_u8_i8816gemm_u8_256x128_64x2_n32t32_align16.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_nt_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_s8/all_sm80_bf16_s16816gemm_bf16_s8_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_s8/cutlass_tensorop_bf16_s16816gemm_bf16_s8_128x128_64x4_tn_align16.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_tn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16/cutlass_tensorop_bf16_s16816gemm_bf16_256x128_32x3_tt_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_u8/all_sm80_bf16_s16816gemm_bf16_u8_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/bf16_s16816gemm_bf16_u8/cutlass_tensorop_bf16_s16816gemm_bf16_u8_128x128_64x4_tn_align16.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/all_sm80_bf16_s16816gemm_planar_complex_array_bf16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nn_align8.cu.o [ 10%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_objs [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/all_sm80_bf16_s16816gemm_planar_complex_bf16_gemm_operations.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_cn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_cn_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nc_align8.cu.o [ 10%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nc_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_cc_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_cc_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nt_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nt_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ct_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ct_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_nh_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_nh_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ch_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ch_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tc_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tc_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hc_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hc_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_tt_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_tt_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_ht_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_ht_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_th_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_th_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_bf16_64x128_32x3_hh_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_planar_complex_array_bf16/cutlass_tensorop_bf16_s16816gemm_planar_complex_array_bf16_64x128_32x3_hh_align8.cu.o [ 11%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_s8_bf16/all_sm80_bf16_s16816gemm_s8_bf16_gemm_operations.cu.o [ 11%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_u8_bf16/all_sm80_bf16_s16816gemm_u8_bf16_gemm_operations.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_s8_bf16/cutlass_tensorop_bf16_s16816gemm_s8_bf16_128x128_64x4_tn_align16.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/bf16_s16816gemm_u8_bf16/cutlass_tensorop_bf16_s16816gemm_u8_bf16_128x128_64x4_tn_align16.cu.o [ 11%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/all_sm80_bf16_s16832spgemm_bf16_gemm_operations.cu.o [ 11%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/all_sm80_c1688gemm_gemm_operations.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_nn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_nt_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_cn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tn_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs.dir/generated/gemm/80/bf16_s16832spgemm_bf16/cutlass_tensorop_bf16_s16832spgemm_bf16_64x128_64x6_tt_align8.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_cc_align1.cu.o [ 11%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_objs [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/all_sm80_c1688tf32gemm_gemm_operations.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nt_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ct_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_cn_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_nh_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nc_align1.cu.o [ 11%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ch_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_cc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ct_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_nh_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ch_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_tt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_ht_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_th_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688gemm_objs.dir/generated/gemm/80/c1688gemm/cutlass_tensorop_c1688gemm_128x64_16x3_hh_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hc_align1.cu.o [ 12%] Built target cutlass_library_gemm_sm80_c1688gemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/all_sm80_cgemm_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_tt_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_ht_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_cn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_th_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nc_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_c1688tf32gemm_objs.dir/generated/gemm/80/c1688tf32gemm/cutlass_tensorop_c1688tf32gemm_128x128_16x4_hh_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_cc_align1.cu.o [ 12%] Built target cutlass_library_gemm_sm80_c1688tf32gemm_objs [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/all_sm80_d884gemm_gemm_operations.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_nn_align1.cu.o [ 12%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nt_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_nt_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_tn_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ct_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_d884gemm_objs.dir/generated/gemm/80/d884gemm/cutlass_tensorop_d884gemm_128x128_16x3_tt_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_nh_align1.cu.o [ 13%] Built target cutlass_library_gemm_sm80_d884gemm_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/all_sm80_dgemm_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_nn_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ch_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_nt_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tn_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_tn_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hn_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_dgemm_objs.dir/generated/gemm/80/dgemm/cutlass_simt_dgemm_128x128_8x3_tt_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tc_align1.cu.o [ 13%] Built target cutlass_library_gemm_sm80_dgemm_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/all_sm80_f16_s16816gemm_f16_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_nn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hc_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_nt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_tt_align1.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs.dir/generated/gemm/80/f16_s16816gemm_f16/cutlass_tensorop_f16_s16816gemm_f16_256x128_32x3_tt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_ht_align1.cu.o [ 13%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_s8/all_sm80_f16_s16816gemm_f16_s8_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_s8/cutlass_tensorop_f16_s16816gemm_f16_s8_128x128_64x4_tn_align16.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_th_align1.cu.o [ 13%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_u8/all_sm80_f16_s16816gemm_f16_u8_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs.dir/generated/gemm/80/f16_s16816gemm_f16_u8/cutlass_tensorop_f16_s16816gemm_f16_u8_128x128_64x4_tn_align16.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_cgemm_objs.dir/generated/gemm/80/cgemm/cutlass_simt_cgemm_128x128_8x5_hh_align1.cu.o [ 13%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/all_sm80_f16_s16816gemm_planar_complex_array_f16_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nn_align8.cu.o [ 13%] Built target cutlass_library_gemm_sm80_cgemm_objs [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/all_sm80_f16_s16816gemm_planar_complex_f16_gemm_operations.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_cn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_cn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_cc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_cc_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nt_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ct_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ct_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_nh_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_nh_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ch_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ch_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tn_align8.cu.o [ 13%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hn_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hc_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_tt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_tt_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_ht_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_ht_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_th_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_th_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_array_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_array_f16_64x128_32x3_hh_align8.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/f16_s16816gemm_planar_complex_f16/cutlass_tensorop_f16_s16816gemm_planar_complex_f16_64x128_32x3_hh_align8.cu.o [ 14%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_objs [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_s8_f16/all_sm80_f16_s16816gemm_s8_f16_gemm_operations.cu.o [ 14%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_objs [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_u8_f16/all_sm80_f16_s16816gemm_u8_f16_gemm_operations.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_s8_f16/cutlass_tensorop_f16_s16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 14%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs.dir/generated/gemm/80/f16_s16816gemm_u8_f16/cutlass_tensorop_f16_s16816gemm_u8_f16_128x128_64x4_tn_align16.cu.o [ 14%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/all_sm80_f16_s16832spgemm_f16_gemm_operations.cu.o [ 15%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/all_sm80_gz884gemm_gemm_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_nn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nn_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_cn_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_nt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nc_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_tn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_cc_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs.dir/generated/gemm/80/f16_s16832spgemm_f16/cutlass_tensorop_f16_s16832spgemm_f16_64x128_64x6_tt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nt_align1.cu.o [ 15%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16_objs [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/all_sm80_h16816gemm_gemm_operations.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ct_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_nn_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_nh_align1.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_nt_align8.cu.o [ 15%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ch_align1.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tn_align1.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_objs.dir/generated/gemm/80/h16816gemm/cutlass_tensorop_h16816gemm_256x128_32x3_tt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hn_align1.cu.o [ 16%] Built target cutlass_library_gemm_sm80_h16816gemm_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/all_sm80_h16816gemm_grouped_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tc_align1.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hc_align1.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_tt_align1.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_ht_align1.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_grouped_objs.dir/generated/gemm/80/h16816gemm_grouped/cutlass_tensorop_h16816gemm_grouped_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_th_align1.cu.o [ 16%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/all_sm80_h16816gemm_planar_complex_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_gz884gemm_objs.dir/generated/gemm/80/gz884gemm/cutlass_tensorop_gz884gemm_64x64_8x3_hh_align1.cu.o [ 16%] Built target cutlass_library_gemm_sm80_gz884gemm_objs [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/all_sm80_h16816gemm_planar_complex_array_gemm_operations.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_cn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_cn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_cc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_cc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ct_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_nh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ct_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ch_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_nh_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ch_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hn_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tc_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_tt_align8.cu.o [ 16%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_ht_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_th_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_ht_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs.dir/generated/gemm/80/h16816gemm_planar_complex/cutlass_tensorop_h16816gemm_planar_complex_64x128_32x3_hh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_th_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs.dir/generated/gemm/80/h16816gemm_s8_f16/all_sm80_h16816gemm_s8_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs.dir/generated/gemm/80/h16816gemm_s8_f16/cutlass_tensorop_h16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs.dir/generated/gemm/80/h16816gemm_planar_complex_array/cutlass_tensorop_h16816gemm_planar_complex_array_64x128_32x3_hh_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/all_sm80_h16832spgemm_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_nn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168128spgemm_s4_objs.dir/generated/gemm/80/i168128spgemm_s4/all_sm80_i168128spgemm_s4_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168128spgemm_s4_objs.dir/generated/gemm/80/i168128spgemm_s4/cutlass_tensorop_i168128spgemm_s4_64x64_256x4_tn_align32.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_nt_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256andgemm_b1_objs.dir/generated/gemm/80/i168256andgemm_b1/all_sm80_i168256andgemm_b1_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256andgemm_b1_objs.dir/generated/gemm/80/i168256andgemm_b1/cutlass_tensorop_i168256andgemm_b1_256x128_512x3_tn_align128.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_tn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256xorgemm_b1_objs.dir/generated/gemm/80/i168256xorgemm_b1/all_sm80_i168256xorgemm_b1_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_h16832spgemm_objs.dir/generated/gemm/80/h16832spgemm/cutlass_tensorop_h16832spgemm_64x128_64x6_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i168256xorgemm_b1_objs.dir/generated/gemm/80/i168256xorgemm_b1/cutlass_tensorop_i168256xorgemm_b1_256x128_512x3_tn_align128.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_s8_objs.dir/generated/gemm/80/i16832gemm_s8/all_sm80_i16832gemm_s8_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_h16832spgemm_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_u8_objs.dir/generated/gemm/80/i16832gemm_u8/all_sm80_i16832gemm_u8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_s8_objs.dir/generated/gemm/80/i16832gemm_s8/cutlass_tensorop_i16832gemm_s8_256x128_64x3_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16832gemm_u8_objs.dir/generated/gemm/80/i16832gemm_u8/cutlass_tensorop_i16832gemm_u8_256x128_64x3_tn_align16.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16832gemm_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_s4_objs.dir/generated/gemm/80/i16864gemm_s4/all_sm80_i16864gemm_s4_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16832gemm_u8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_u4_objs.dir/generated/gemm/80/i16864gemm_u4/all_sm80_i16864gemm_u4_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_s4_objs.dir/generated/gemm/80/i16864gemm_s4/cutlass_tensorop_i16864gemm_s4_256x128_128x3_tn_align32.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864gemm_u4_objs.dir/generated/gemm/80/i16864gemm_u4/cutlass_tensorop_i16864gemm_u4_256x128_128x3_tn_align32.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16864gemm_s4_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864spgemm_s8_objs.dir/generated/gemm/80/i16864spgemm_s8/all_sm80_i16864spgemm_s8_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16864gemm_u4_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/all_sm80_s16816gemm_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_i16864spgemm_s8_objs.dir/generated/gemm/80/i16864spgemm_s8/cutlass_tensorop_i16864spgemm_s8_128x64_128x3_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_nn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/s16816gemm_bf16_s8/all_sm80_s16816gemm_bf16_s8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs.dir/generated/gemm/80/s16816gemm_bf16_s8/cutlass_tensorop_s16816gemm_bf16_s8_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_tn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/s16816gemm_bf16_u8/all_sm80_s16816gemm_bf16_u8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs.dir/generated/gemm/80/s16816gemm_bf16_u8/cutlass_tensorop_s16816gemm_bf16_u8_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_bf16_objs.dir/generated/gemm/80/s16816gemm_bf16/cutlass_tensorop_s16816gemm_bf16_256x128_32x3_tt_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/all_sm80_s16816gemm_f16_gemm_operations.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs.dir/generated/gemm/80/s16816gemm_f16_s8/all_sm80_s16816gemm_f16_s8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs.dir/generated/gemm/80/s16816gemm_f16_s8/cutlass_tensorop_s16816gemm_f16_s8_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_nt_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs.dir/generated/gemm/80/s16816gemm_f16_u8/all_sm80_s16816gemm_f16_u8_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs.dir/generated/gemm/80/s16816gemm_f16_u8/cutlass_tensorop_s16816gemm_f16_u8_128x128_64x4_tn_align16.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_tn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/all_sm80_s16816gemm_grouped_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_f16_objs.dir/generated/gemm/80/s16816gemm_f16/cutlass_tensorop_s16816gemm_f16_256x128_32x3_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/all_sm80_s16816gemm_grouped_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_nn_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_nt_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs.dir/generated/gemm/80/s16816gemm_grouped_bf16/cutlass_tensorop_s16816gemm_grouped_bf16_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_tn_align8_scheduleDevice.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/all_sm80_s16816gemm_planar_complex_array_bf16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs.dir/generated/gemm/80/s16816gemm_grouped_f16/cutlass_tensorop_s16816gemm_grouped_f16_256x128_32x3_tt_align8_scheduleDevice.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_cn_align8.cu.o [ 17%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16_objs [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/all_sm80_s16816gemm_planar_complex_array_f16_gemm_operations.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_cn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_cc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_cc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_nh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ct_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ch_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_nh_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ch_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hn_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_tt_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hc_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_ht_align8.cu.o [ 17%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_tt_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_th_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_ht_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_bf16/cutlass_tensorop_s16816gemm_planar_complex_array_bf16_64x128_32x3_hh_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_th_align8.cu.o [ 18%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_objs [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/all_sm80_s16816gemm_planar_complex_bf16_gemm_operations.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_array_f16/cutlass_tensorop_s16816gemm_planar_complex_array_f16_64x128_32x3_hh_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nn_align8.cu.o [ 18%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_objs [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/all_sm80_s16816gemm_planar_complex_f16_gemm_operations.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_cn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_cn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_cc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nt_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_cc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ct_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nt_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_nh_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ct_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ch_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_nh_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ch_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hn_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hc_align8.cu.o [ 18%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tc_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_tt_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hc_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_ht_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_tt_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_th_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_ht_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_bf16/cutlass_tensorop_s16816gemm_planar_complex_bf16_64x128_32x3_hh_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_th_align8.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/s16816gemm_s8_bf16/all_sm80_s16816gemm_s8_bf16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs.dir/generated/gemm/80/s16816gemm_planar_complex_f16/cutlass_tensorop_s16816gemm_planar_complex_f16_64x128_32x3_hh_align8.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs.dir/generated/gemm/80/s16816gemm_s8_bf16/cutlass_tensorop_s16816gemm_s8_bf16_128x128_64x4_tn_align16.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs.dir/generated/gemm/80/s16816gemm_s8_f16/all_sm80_s16816gemm_s8_f16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs.dir/generated/gemm/80/s16816gemm_s8_f16/cutlass_tensorop_s16816gemm_s8_f16_128x128_64x4_tn_align16.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/s16816gemm_u8_bf16/all_sm80_s16816gemm_u8_bf16_gemm_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs.dir/generated/gemm/80/s16816gemm_u8_bf16/cutlass_tensorop_s16816gemm_u8_bf16_128x128_64x4_tn_align16.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16_objs [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cfprop_optimized_cf32/all_sm75_cf32_cfprop_optimized_cf32_conv2d_operations.cu.o [ 19%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cfprop_optimized_cf32/cutlass_simt_cf32_cfprop_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 19%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs.dir/generated/gemm/80/s16816gemm_u8_f16/all_sm80_s16816gemm_u8_f16_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs.dir/generated/gemm/80/s16816gemm_u8_f16/cutlass_tensorop_s16816gemm_u8_f16_128x128_64x4_tn_align16.cu.o [ 20%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/all_sm80_s16816tf32spgemm_gemm_operations.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/all_sm80_s16832spgemm_bf16_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_nn_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_nt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_tn_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16816tf32spgemm_objs.dir/generated/gemm/80/s16816tf32spgemm/cutlass_tensorop_s16816tf32spgemm_128x64_32x3_tt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_bf16_objs.dir/generated/gemm/80/s16832spgemm_bf16/cutlass_tensorop_s16832spgemm_bf16_64x128_64x6_tt_align8.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/all_sm80_s16832spgemm_f16_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_nn_align8.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/all_sm80_s1688bf16gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_nt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_tn_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s16832spgemm_f16_objs.dir/generated/gemm/80/s16832spgemm_f16/cutlass_tensorop_s16832spgemm_f16_64x128_64x6_tt_align8.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688bf16gemm_objs.dir/generated/gemm/80/s1688bf16gemm/cutlass_tensorop_s1688bf16gemm_256x128_16x3_tt_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/all_sm80_s1688f16gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_nn_align4.cu.o [ 20%] Built target cutlass_library_gemm_sm80_s1688bf16gemm_objs [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/all_sm80_s1688gemm_gemm_operations.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_nn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_nt_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_tn_align4.cu.o [ 20%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_tn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688f16gemm_objs.dir/generated/gemm/80/s1688f16gemm/cutlass_tensorop_s1688f16gemm_256x128_16x3_tt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_objs.dir/generated/gemm/80/s1688gemm/cutlass_tensorop_s1688gemm_128x128_16x4_tt_align4.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688f16gemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/all_sm80_s1688gemm_tf32_gemm_operations.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688gemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/all_sm80_s1688tf32gemm_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_nn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_nn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_nt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_nt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_tn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_tn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688gemm_tf32_objs.dir/generated/gemm/80/s1688gemm_tf32/cutlass_tensorop_s1688gemm_tf32_256x128_16x3_tt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s1688tf32gemm_objs.dir/generated/gemm/80/s1688tf32gemm/cutlass_tensorop_s1688tf32gemm_256x128_16x3_tt_align4.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs.dir/generated/gemm/80/s4_i168128spgemm_s4/all_sm80_s4_i168128spgemm_s4_gemm_operations.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s1688tf32gemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/all_sm80_s4_i16864gemm_s4_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs.dir/generated/gemm/80/s4_i168128spgemm_s4/cutlass_tensorop_s4_i168128spgemm_s4_64x64_256x4_tn_align32.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/cutlass_tensorop_s4_i16864gemm_s4_256x128_128x3_tn_align32.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/all_sm80_s8_i16832gemm_s8_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/cutlass_tensorop_s8_i16832gemm_s8_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs.dir/generated/gemm/80/s4_i16864gemm_s4/cutlass_tensorop_s4_i16864gemm_s4_256x128_128x3_n64t64_align32.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs.dir/generated/gemm/80/s8_i16864spgemm_s8/all_sm80_s8_i16864spgemm_s8_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs.dir/generated/gemm/80/s8_i16832gemm_s8/cutlass_tensorop_s8_i16832gemm_s8_256x128_64x3_n32t32_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs.dir/generated/gemm/80/s8_i16864spgemm_s8/cutlass_tensorop_s8_i16864spgemm_s8_128x64_128x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/all_sm80_sgemm_gemm_operations.cu.o [ 21%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/all_sm80_tf32_s1688gemm_tf32_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_nn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_nt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_nt_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_tn_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs.dir/generated/gemm/80/tf32_s1688gemm_tf32/cutlass_tensorop_tf32_s1688gemm_tf32_256x128_16x3_tt_align4.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_sgemm_objs.dir/generated/gemm/80/sgemm/cutlass_simt_sgemm_256x128_8x5_tt_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/all_sm80_u4_i16864gemm_u4_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/cutlass_tensorop_u4_i16864gemm_u4_256x128_128x3_tn_align32.cu.o [ 21%] Built target cutlass_library_gemm_sm80_sgemm_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/all_sm80_u8_i16832gemm_u8_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs.dir/generated/gemm/80/u4_i16864gemm_u4/cutlass_tensorop_u4_i16864gemm_u4_256x128_128x3_n64t64_align32.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/cutlass_tensorop_u8_i16832gemm_u8_256x128_64x3_tn_align16.cu.o [ 21%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/all_sm80_z884gemm_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs.dir/generated/gemm/80/u8_i16832gemm_u8/cutlass_tensorop_u8_i16832gemm_u8_256x128_64x3_n32t32_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nn_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3/all_sm89_s16832fastaccumgemm_e4m3_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_cn_align1.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3/cutlass_tensorop_s16832fastaccumgemm_e4m3_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nc_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3_e5m2/all_sm89_s16832fastaccumgemm_e4m3_e5m2_gemm_operations.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e4m3_e5m2/cutlass_tensorop_s16832fastaccumgemm_e4m3_e5m2_256x128_64x3_tn_align16.cu.o [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_cc_align1.cu.o [ 21%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_objs [ 21%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nt_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ct_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_nh_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ch_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tn_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hn_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tc_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hc_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_tt_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_ht_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_th_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm80_z884gemm_objs.dir/generated/gemm/80/z884gemm/cutlass_tensorop_z884gemm_128x64_8x3_hh_align1.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2/all_sm89_s16832fastaccumgemm_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2/cutlass_tensorop_s16832fastaccumgemm_e5m2_256x128_64x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm80_z884gemm_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2_e4m3/all_sm89_s16832fastaccumgemm_e5m2_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832fastaccumgemm_e5m2_e4m3/cutlass_tensorop_s16832fastaccumgemm_e5m2_e4m3_256x128_64x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_objs.dir/generated/gemm/89/s16832gemm_e4m3/all_sm89_s16832gemm_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_objs.dir/generated/gemm/89/s16832gemm_e4m3/cutlass_tensorop_s16832gemm_e4m3_256x128_64x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832gemm_e4m3_e5m2/all_sm89_s16832gemm_e4m3_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16832gemm_e4m3_e5m2/cutlass_tensorop_s16832gemm_e4m3_e5m2_256x128_64x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_objs.dir/generated/gemm/89/s16832gemm_e5m2/all_sm89_s16832gemm_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_objs.dir/generated/gemm/89/s16832gemm_e5m2/cutlass_tensorop_s16832gemm_e5m2_256x128_64x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832gemm_e5m2_e4m3/all_sm89_s16832gemm_e5m2_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16832gemm_e5m2_e4m3/cutlass_tensorop_s16832gemm_e5m2_e4m3_256x128_64x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3/all_sm89_s16864fastaccumspgemm_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3/cutlass_tensorop_s16864fastaccumspgemm_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3_e5m2/all_sm89_s16864fastaccumspgemm_e4m3_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e4m3_e5m2/cutlass_tensorop_s16864fastaccumspgemm_e4m3_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2/all_sm89_s16864fastaccumspgemm_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2/cutlass_tensorop_s16864fastaccumspgemm_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2_e4m3/all_sm89_s16864fastaccumspgemm_e5m2_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864fastaccumspgemm_e5m2_e4m3/cutlass_tensorop_s16864fastaccumspgemm_e5m2_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e4m3/all_sm89_s16864spgemm_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e4m3/cutlass_tensorop_s16864spgemm_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e4m3_e5m2/all_sm89_s16864spgemm_e4m3_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e4m3_e5m2/cutlass_tensorop_s16864spgemm_e4m3_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e5m2/all_sm89_s16864spgemm_e5m2_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs.dir/generated/gemm/89/s16864spgemm_e5m2/cutlass_tensorop_s16864spgemm_e5m2_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e5m2_e4m3/all_sm89_s16864spgemm_e5m2_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs.dir/generated/gemm/89/s16864spgemm_e5m2_e4m3/cutlass_tensorop_s16864spgemm_e5m2_e4m3_128x64_128x3_tn_align16.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/all_sm90_bf16_s64x128x16gemm_bf16_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8.cu.o [ 22%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_objs [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/all_sm90_bf16_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 22%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 23%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 24%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/bf16_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_bf16_bf16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 25%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_objs [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/all_sm90_bf16_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 25%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 25%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_objs [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/all_sm90_bf16_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 26%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 27%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 28%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/all_sm90_bf16_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/all_sm90_d1684gemm_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_nnn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_ntn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_tnn_align1.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_d1684gemm_objs.dir/generated/gemm/90/d1684gemm/cutlass_sm90_tensorop_d1684gemm_f64_f64_f64_f64_f64_128x128x16_1x1x1_3_ttn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 29%] Built target cutlass_library_gemm_sm90_d1684gemm_objs [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/all_sm90_f16_s64x128x16gemm_f16_gemm_operations.cu.o [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 29%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 30%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::bfloat16_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 31%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_bf16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/bf16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_bf16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_objs [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/all_sm90_f16_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs.dir/generated/gemm/90/f16_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f16_f16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_objs [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/all_sm90_f16_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 32%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 33%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 34%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_objs [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/all_sm90_f16_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_objs [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/all_sm90_f16_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 35%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 36%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 37%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_f16_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_objs [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/all_sm90_gz1684gemm_gemm_operations.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_nnn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_cnn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ncn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ccn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/f16_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f16_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ntn_align1.cu.o [ 38%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_objs [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/all_sm90_h64x128x16gemm_gemm_operations.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ctn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_nhn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_chn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_tnn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hnn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_tcn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hcn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_ttn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_htn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_thn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_gz1684gemm_objs.dir/generated/gemm/90/gz1684gemm/cutlass_sm90_tensorop_gz1684gemm_cf64_cf64_cf64_cf64_cf64_64x64x8_1x1x1_3_hhn_align1.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8.cu.o [ 38%] Built target cutlass_library_gemm_sm90_gz1684gemm_objs [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/all_sm90_i64x128x32gemm_s8_gemm_operations.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs.dir/generated/gemm/90/i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8_objs [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/all_sm90_i64x128x32gemm_u8_gemm_operations.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = int; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, int, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs.dir/generated/gemm/90/i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s32_s32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 38%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8_objs [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 38%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_h64x128x16gemm_objs.dir/generated/gemm/90/h64x128x16gemm/cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_f16_f16_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/all_sm90_s64x128x16gemm_bf16_gemm_operations.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o [ 39%] Built target cutlass_library_gemm_sm90_h64x128x16gemm_objs [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/all_sm90_s64x128x16gemm_f16_gemm_operations.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 39%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_nosmem.cu.o [ 40%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 41%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 42%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_tnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ttn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_nnn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_objs [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/all_sm90_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs.dir/generated/gemm/90/s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_f32_128x128x64_1x1x1_0_ntn_align2_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16_objs [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/all_sm90_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 43%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 44%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 45%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_objs [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/all_sm90_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_objs [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/all_sm90_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 46%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 47%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 2; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_64x128x128_1x2x1_0_tnn_align16_warpspecialized_pingpong_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x2x1_0_tnn_align16_stream_k_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 48%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_f32_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e4m3_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 49%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_objs [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/all_sm90_s64x128x8gemm_tf32_gemm_operations.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_f32_e5m2_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_objs [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/all_sm90_s64x128x8tf32gemm_gemm_operations.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_pingpong_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, cutlass::tfloat32_t, cute::tuple, long int, long int>, cutlass::tfloat32_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, cutlass::tfloat32_t>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<32> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, cute::Copy_Atom, float>, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<8> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> >, float, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<32> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_tnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_nnn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_ntn_align4_warpspecialized_cooperative_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_64x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_64x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_2x1x1_0_ttn_align4_warpspecialized_pingpong.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_256x128x32_1x2x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_256x128x32_1x2x1_0_ttn_align4_warpspecialized_cooperative.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 50%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align2_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_tnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_nnn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 51%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align2_cpasync_warpspecialized.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ntn_align1_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs.dir/generated/gemm/90/s64x128x8gemm_tf32/cutlass3x_sm90_tensorop_s64x128x8gemm_tf32_tf32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align1_cpasync_warpspecialized.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align2_cpasync_warpspecialized.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/all_sm90_s8_i64x128x32gemm_s8_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs.dir/generated/gemm/90/s64x128x8tf32gemm/cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_f32_128x128x32_1x1x1_0_ttn_align1_cpasync_warpspecialized.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/all_sm90_s8_i64x128x32gemm_u8_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 32; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<64> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<64> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<64>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<2, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<64> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = signed char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, signed char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, signed char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = signed char; StrideC_ = cute::tuple, long int, long int>; ElementD_ = unsigned char; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, signed char, cute::tuple, long int, long int>, unsigned char, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, unsigned char>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_s8_s8_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align8_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/all_sm90_void_i64x128x32gemm_s8_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs.dir/generated/gemm/90/s8_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_s8_u8_128x128x128_1x1x1_0_tnn_align4_stream_k_cpasync_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/all_sm90_void_i64x128x32gemm_u8_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_pingpong_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, signed char, cute::tuple, long int>, signed char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs.dir/generated/gemm/90/void_i64x128x32gemm_s8/cutlass3x_sm90_tensorop_i64x128x32gemm_s8_s8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = int; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, unsigned char, cute::tuple, long int>, unsigned char, cute::tuple, long int>, cute::TiledMMA, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, int, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, int>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/all_sm90_void_s64x128x16gemm_bf16_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs.dir/generated/gemm/90/void_i64x128x32gemm_u8/cutlass3x_sm90_tensorop_i64x128x32gemm_u8_u8_s32_void_s32_128x128x128_2x1x1_0_tnn_align16_stream_k_warpspecialized_cooperative_epi_nosmem.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o [ 52%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_objs [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/all_sm90_void_s64x128x16gemm_f16_gemm_operations.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<64>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = float; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f32_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, float, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<32> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, float>, cute::Layout, cute::tuple, cute::C<32>, cute::C<4> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<32> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 52%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp:212:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_pingpong_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedPingpong>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_nnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ntn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs.dir/generated/gemm/90/void_s64x128x16gemm_bf16/cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs.dir/generated/gemm/90/void_s64x128x16gemm_f16/cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::bfloat16_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_bf16_bf16_f32_void_bf16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::bfloat16_t, cute::tuple, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::bfloat16_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::bfloat16_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/all_sm90_void_s64x128x32gemm_e4m3_gemm_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<64> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpS2R_ = cute::SM75_U16x8_LDSM_T; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >; CopyOpR2S_ = cute::SM90_U16x8_STSM_T]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>; TileScheduler_ = cutlass::gemm::StreamKScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_void_f16_128x128x64_2x1x1_0_ttn_align8_stream_k_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<1>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<64> >, cutlass::half_t, cute::tuple, long int>, cutlass::half_t, cute::tuple, long int, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::half_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<64> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM75_U16x8_LDSM_T, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<64> > > >, cute::SM90_U16x8_STSM_T>, cutlass::gemm::StreamKScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<32>, cute::C<2> > > >, cute::tuple, cute::tuple, cute::C<128>, cute::C<64> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/all_sm90_void_s64x128x32gemm_e4m3_e5m2_gemm_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e4m3_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/all_sm90_void_s64x128x32gemm_e5m2_gemm_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e4m3_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e4m3_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/all_sm90_void_s64x128x32gemm_e5m2_e4m3_gemm_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e5m2_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e5m2_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e5m2_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e5m2_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperative>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs.dir/generated/gemm/90/void_s64x128x32gemm_e5m2_e4m3/cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e5m2_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e5m2_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/all_sm90_z1684gemm_gemm_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_nnn_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_cnn_align1.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple; int StagesC_ = 4; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = false; bool DelayTmaStore_ = true; CtaTileMNK_ = cute::tuple, cute::C<128>, cute::C<128> >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = void; StrideC_ = cute::tuple, long int, long int>; ElementD_ = cutlass::float_e4m3_t; StrideD_ = cute::tuple, long int, long int>; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:211:184: required from 'static cutlass::gemm::kernel::GemmUniversal, void>::type>::Params cutlass::gemm::kernel::GemmUniversal, void>::type>::to_underlying_arguments(const cutlass::gemm::kernel::GemmUniversal, void>::type>::Arguments&, void*) [with ProblemShape_ = cute::tuple; CollectiveMainloop_ = cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileScheduler_ = cutlass::gemm::PersistentScheduler]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:292:48: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:429:17: required from 'cutlass::Status cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with GemmKernel_ = cutlass3x_sm90_tensorop_s64x128x32gemm_e5m2_e4m3_f32_void_e4m3_256x128x128_1x2x1_0_tnn_align16_warpspecialized_cooperative_fp8_fastaccum_epi_tma; typename std::enable_if::value, void>::type = void; cutlass::gemm::device::GemmUniversalAdapter::value, void>::type>::Arguments = cutlass::gemm::kernel::GemmUniversal, cutlass::gemm::collective::CollectiveMma, cute::C<2>, cute::C<1> >, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>, cute::tuple, cute::C<128>, cute::C<128> >, cutlass::float_e5m2_t, cute::tuple, long int>, cutlass::float_e4m3_t, cute::tuple, long int>, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<128> >, cute::tuple, cute::C<1> > > >, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> >, void, cute::tuple, long int, long int>, cutlass::float_e4m3_t, cute::tuple, long int, long int>, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<128>, cute::C<128> >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE, cute::ComposedLayout, cute::smem_ptr_flag_bits<8>, cute::Layout, cute::C<8> >, cute::tuple, cute::C<128> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, cutlass::gemm::PersistentScheduler>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:340:8: required from 'cutlass::Status cutlass::library::GemmUniversal3xOperation::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::gemm::device::GemmUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/gemm_operation_3x.hpp:326:17: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::AuxTmaParams, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> >, const cute::Layout, cute::C<32>, cute::C<1> >, cute::tuple, 0>, cute::ScaledBasis, 1>, cute::ScaledBasis, 2> > >&, const cute::Swizzle<3, 4, 3>&> >, cutlass::float_e4m3_t>, cute::Layout, cute::tuple > >, cute::tuple, cute::tuple > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 53%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/all_sm50_cf32_cdgrad_optimized_cf32_conv2d_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x64_8x2_nhwc_unity_stride_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ncn_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ccn_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ntn_align1.cu.o [ 53%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_objs [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cfprop_optimized_cf32/all_sm50_cf32_cfprop_optimized_cf32_conv2d_operations.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ctn_align1.cu.o [ 53%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cfprop_optimized_cf32/cutlass_simt_cf32_cfprop_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_nhn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cwgrad_optimized_cf32/all_sm50_cf32_cwgrad_optimized_cf32_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/50/cf32_cwgrad_optimized_cf32/cutlass_simt_cf32_cwgrad_optimized_cf32_128x64_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_chn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_tnn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/all_sm50_sdgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/cutlass_simt_sdgrad_optimized_128x128_8x2_nhwc_unity_stride_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hnn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sdgrad_optimized_objs.dir/generated/conv2d/50/sdgrad_optimized/cutlass_simt_sdgrad_optimized_128x128_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_tcn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sfprop_optimized_objs.dir/generated/conv2d/50/sfprop_optimized/all_sm50_sfprop_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_sfprop_optimized_objs.dir/generated/conv2d/50/sfprop_optimized/cutlass_simt_sfprop_optimized_128x128_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hcn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_ttn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_sfprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_swgrad_optimized_objs.dir/generated/conv2d/50/swgrad_optimized/all_sm50_swgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm50_swgrad_optimized_objs.dir/generated/conv2d/50/swgrad_optimized/cutlass_simt_swgrad_optimized_128x128_8x2_nhwc_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_htn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm50_swgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm60_hfprop_optimized_objs.dir/generated/conv2d/60/hfprop_optimized/all_sm60_hfprop_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm60_hfprop_optimized_objs.dir/generated/conv2d/60/hfprop_optimized/cutlass_simt_hfprop_optimized_64x32x9_1x8x8x32_3_filter3x3_nhwc_depthwise_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_thn_align1.cu.o [ 54%] Built target cutlass_library_conv2d_sm60_hfprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/all_sm70_f16_s884dgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_gemm_sm90_z1684gemm_objs.dir/generated/gemm/90/z1684gemm/cutlass_sm90_tensorop_z1684gemm_cf64_cf64_cf64_cf64_cf64_128x64x8_1x1x1_3_hhn_align1.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/cutlass_tensorop_f16_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 54%] Built target cutlass_library_gemm_sm90_z1684gemm_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/f16_s884fprop_optimized_f16/all_sm70_f16_s884fprop_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884dgrad_optimized_f16/cutlass_tensorop_f16_s884dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/f16_s884fprop_optimized_f16/cutlass_tensorop_f16_s884fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884wgrad_optimized_f16/all_sm70_f16_s884wgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/all_sm70_h884dgrad_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/f16_s884wgrad_optimized_f16/cutlass_tensorop_f16_s884wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/cutlass_tensorop_h884dgrad_optimized_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884fprop_optimized_objs.dir/generated/conv2d/70/h884fprop_optimized/all_sm70_h884fprop_optimized_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884dgrad_optimized_objs.dir/generated/conv2d/70/h884dgrad_optimized/cutlass_tensorop_h884dgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884fprop_optimized_objs.dir/generated/conv2d/70/h884fprop_optimized/cutlass_tensorop_h884fprop_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884wgrad_optimized_objs.dir/generated/conv2d/70/h884wgrad_optimized/all_sm70_h884wgrad_optimized_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/all_sm70_s884dgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_h884wgrad_optimized_objs.dir/generated/conv2d/70/h884wgrad_optimized/cutlass_tensorop_h884wgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/cutlass_tensorop_s884dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/s884fprop_optimized_f16/all_sm70_s884fprop_optimized_f16_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs.dir/generated/conv2d/70/s884dgrad_optimized_f16/cutlass_tensorop_s884dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs.dir/generated/conv2d/70/s884fprop_optimized_f16/cutlass_tensorop_s884fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/s884wgrad_optimized_f16/all_sm70_s884wgrad_optimized_f16_conv2d_operations.cu.o [ 54%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16_objs [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/all_sm75_cf32_cdgrad_optimized_cf32_conv2d_operations.cu.o [ 54%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs.dir/generated/conv2d/70/s884wgrad_optimized_f16/cutlass_tensorop_s884wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x128_8x5_nhwc_unity_stride_align1.cu.o [ 55%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cwgrad_optimized_cf32/all_sm75_cf32_cwgrad_optimized_cf32_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cwgrad_optimized_cf32/cutlass_simt_cf32_cwgrad_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs.dir/generated/conv2d/75/cf32_cdgrad_optimized_cf32/cutlass_simt_cf32_cdgrad_optimized_cf32_128x128_8x5_nhwc_align1.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/all_sm75_f16_s1688dgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_few_channels_f16/all_sm75_f16_s1688fprop_few_channels_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/cutlass_tensorop_f16_s1688dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_few_channels_f16/cutlass_tensorop_f16_s1688fprop_few_channels_f16_128x64_32x2_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688dgrad_optimized_f16/cutlass_tensorop_f16_s1688dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_fixed_channels_f16/all_sm75_f16_s1688fprop_fixed_channels_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_fixed_channels_f16/cutlass_tensorop_f16_s1688fprop_fixed_channels_f16_128x64_32x2_nhwc_align4.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_optimized_f16/all_sm75_f16_s1688fprop_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688fprop_optimized_f16/cutlass_tensorop_f16_s1688fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688wgrad_optimized_f16/all_sm75_f16_s1688wgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/f16_s1688wgrad_optimized_f16/cutlass_tensorop_f16_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/all_sm75_h1688dgrad_optimized_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/cutlass_tensorop_h1688dgrad_optimized_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs.dir/generated/conv2d/75/h1688fprop_few_channels/all_sm75_h1688fprop_few_channels_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs.dir/generated/conv2d/75/h1688fprop_few_channels/cutlass_tensorop_h1688fprop_few_channels_128x64_32x2_nhwc_align1.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs.dir/generated/conv2d/75/h1688dgrad_optimized/cutlass_tensorop_h1688dgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs.dir/generated/conv2d/75/h1688fprop_fixed_channels/all_sm75_h1688fprop_fixed_channels_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_optimized_objs.dir/generated/conv2d/75/h1688fprop_optimized/all_sm75_h1688fprop_optimized_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs.dir/generated/conv2d/75/h1688fprop_fixed_channels/cutlass_tensorop_h1688fprop_fixed_channels_128x64_32x2_nhwc_align4.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688fprop_optimized_objs.dir/generated/conv2d/75/h1688fprop_optimized/cutlass_tensorop_h1688fprop_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs.dir/generated/conv2d/75/h1688wgrad_optimized/all_sm75_h1688wgrad_optimized_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/i8816fprop_optimized_s8/all_sm75_i8816fprop_optimized_s8_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs.dir/generated/conv2d/75/h1688wgrad_optimized/cutlass_tensorop_h1688wgrad_optimized_256x128_32x2_nhwc_align8.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/i8816fprop_optimized_s8/cutlass_tensorop_i8816fprop_optimized_s8_256x128_64x2_nhwc_align16.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/i8816fprop_optimized_u8/all_sm75_i8816fprop_optimized_u8_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/i8832fprop_optimized_s4/all_sm75_i8832fprop_optimized_s4_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/i8816fprop_optimized_u8/cutlass_tensorop_i8816fprop_optimized_u8_256x128_64x2_nhwc_align16.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/i8832fprop_optimized_s4/cutlass_tensorop_i8832fprop_optimized_s4_256x128_128x2_nhwc_align32.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/i8832fprop_optimized_u4/all_sm75_i8832fprop_optimized_u4_conv2d_operations.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/all_sm75_s1688dgrad_optimized_f16_conv2d_operations.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/i8832fprop_optimized_u4/cutlass_tensorop_i8832fprop_optimized_u4_256x128_128x2_nhwc_align32.cu.o [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/cutlass_tensorop_s1688dgrad_optimized_f16_256x128_32x2_nhwc_unity_stride_align8.cu.o [ 55%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_objs [ 55%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_few_channels_f16/all_sm75_s1688fprop_few_channels_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688dgrad_optimized_f16/cutlass_tensorop_s1688dgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_few_channels_f16/cutlass_tensorop_s1688fprop_few_channels_f16_128x64_32x2_nhwc_align1.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_fixed_channels_f16/all_sm75_s1688fprop_fixed_channels_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs.dir/generated/conv2d/75/s1688fprop_fixed_channels_f16/cutlass_tensorop_s1688fprop_fixed_channels_f16_128x64_32x2_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/s1688fprop_optimized_f16/all_sm75_s1688fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs.dir/generated/conv2d/75/s1688fprop_optimized_f16/cutlass_tensorop_s1688fprop_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688wgrad_optimized_f16/all_sm75_s1688wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs.dir/generated/conv2d/75/s1688wgrad_optimized_f16/cutlass_tensorop_s1688wgrad_optimized_f16_256x128_32x2_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/all_sm75_s4_i8832fprop_optimized_s4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/cutlass_tensorop_s4_i8832fprop_optimized_s4_256x128_128x2_nhwc_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_few_channels_s8/all_sm75_s8_i8816fprop_few_channels_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_few_channels_s8/cutlass_tensorop_s8_i8816fprop_few_channels_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs.dir/generated/conv2d/75/s4_i8832fprop_optimized_s4/cutlass_tensorop_s4_i8832fprop_optimized_s4_256x128_128x2_nc64hw64_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_fixed_channels_s8/all_sm75_s8_i8816fprop_fixed_channels_s8_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/all_sm75_s8_i8816fprop_optimized_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_fixed_channels_s8/cutlass_tensorop_s8_i8816fprop_fixed_channels_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/cutlass_tensorop_s8_i8816fprop_optimized_s8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs.dir/generated/conv2d/75/s8_i8816fprop_optimized_s8/cutlass_tensorop_s8_i8816fprop_optimized_s8_256x128_64x2_nc32hw32_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/all_sm75_u4_i8832fprop_optimized_u4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/cutlass_tensorop_u4_i8832fprop_optimized_u4_256x128_128x2_nhwc_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_few_channels_u8/all_sm75_u8_i8816fprop_few_channels_u8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_few_channels_u8/cutlass_tensorop_u8_i8816fprop_few_channels_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs.dir/generated/conv2d/75/u4_i8832fprop_optimized_u4/cutlass_tensorop_u4_i8832fprop_optimized_u4_256x128_128x2_nc64hw64_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_fixed_channels_u8/all_sm75_u8_i8816fprop_fixed_channels_u8_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/all_sm75_u8_i8816fprop_optimized_u8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_fixed_channels_u8/cutlass_tensorop_u8_i8816fprop_fixed_channels_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/cutlass_tensorop_u8_i8816fprop_optimized_u8_256x128_64x2_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/all_sm80_bf16_s16816dgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs.dir/generated/conv2d/75/u8_i8816fprop_optimized_u8/cutlass_tensorop_u8_i8816fprop_optimized_u8_256x128_64x2_nc32hw32_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_fixed_channels_bf16/all_sm80_bf16_s16816fprop_fixed_channels_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816dgrad_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_fixed_channels_bf16/cutlass_tensorop_bf16_s16816fprop_fixed_channels_bf16_256x128_32x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/all_sm80_bf16_s16816fprop_optimized_bf16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816wgrad_optimized_bf16/all_sm80_bf16_s16816wgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/cutlass_tensorop_bf16_s16816fprop_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816wgrad_optimized_bf16/cutlass_tensorop_bf16_s16816wgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/bf16_s16816fprop_optimized_bf16/cutlass_tensorop_bf16_s16816fprop_optimized_bf16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/all_sm80_f16_s16816dgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_fixed_channels_f16/all_sm80_f16_s16816fprop_fixed_channels_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_fixed_channels_f16/cutlass_tensorop_f16_s16816fprop_fixed_channels_f16_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816dgrad_optimized_f16/cutlass_tensorop_f16_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/all_sm80_f16_s16816fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816wgrad_optimized_f16/all_sm80_f16_s16816wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/cutlass_tensorop_f16_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816wgrad_optimized_f16/cutlass_tensorop_f16_s16816wgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/all_sm80_h16816dgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/f16_s16816fprop_optimized_f16/cutlass_tensorop_f16_s16816fprop_optimized_f16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/cutlass_tensorop_h16816dgrad_optimized_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs.dir/generated/conv2d/80/h16816fprop_fixed_channels/all_sm80_h16816fprop_fixed_channels_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs.dir/generated/conv2d/80/h16816dgrad_optimized/cutlass_tensorop_h16816dgrad_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs.dir/generated/conv2d/80/h16816fprop_fixed_channels/cutlass_tensorop_h16816fprop_fixed_channels_256x128_32x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/all_sm80_h16816fprop_optimized_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs.dir/generated/conv2d/80/h16816wgrad_optimized/all_sm80_h16816wgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/cutlass_tensorop_h16816fprop_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs.dir/generated/conv2d/80/h16816wgrad_optimized/cutlass_tensorop_h16816wgrad_optimized_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/all_sm80_i16832fprop_optimized_s8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_h16816fprop_optimized_objs.dir/generated/conv2d/80/h16816fprop_optimized/cutlass_tensorop_h16816fprop_optimized_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/all_sm80_i16832fprop_optimized_u8_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_single_group_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/cutlass_tensorop_i16832fprop_optimized_u8_256x128_64x3_nhwc_align16.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/all_sm80_i16864fprop_optimized_s4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/i16832fprop_optimized_u8/cutlass_tensorop_i16832fprop_optimized_u8_256x128_64x3_nhwc_single_group_align16.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/cutlass_tensorop_i16864fprop_optimized_s4_256x128_128x3_nhwc_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/all_sm80_i16864fprop_optimized_u4_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/i16864fprop_optimized_s4/cutlass_tensorop_i16864fprop_optimized_s4_256x128_128x3_nhwc_single_group_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/cutlass_tensorop_i16864fprop_optimized_u4_256x128_128x3_nhwc_align32.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/all_sm80_s16816dgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/i16864fprop_optimized_u4/cutlass_tensorop_i16864fprop_optimized_u4_256x128_128x3_nhwc_single_group_align32.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/cutlass_tensorop_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/all_sm80_s16816dgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_bf16/cutlass_tensorop_s16816dgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/cutlass_tensorop_s16816dgrad_optimized_f16_256x128_32x3_nhwc_unity_stride_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_bf16/all_sm80_s16816fprop_fixed_channels_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816dgrad_optimized_f16/cutlass_tensorop_s16816dgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_bf16/cutlass_tensorop_s16816fprop_fixed_channels_bf16_256x128_32x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_f16/all_sm80_s16816fprop_fixed_channels_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/all_sm80_s16816fprop_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs.dir/generated/conv2d/80/s16816fprop_fixed_channels_f16/cutlass_tensorop_s16816fprop_fixed_channels_f16_256x128_32x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/cutlass_tensorop_s16816fprop_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/all_sm80_s16816fprop_optimized_f16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs.dir/generated/conv2d/80/s16816fprop_optimized_bf16/cutlass_tensorop_s16816fprop_optimized_bf16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/cutlass_tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_bf16/all_sm80_s16816wgrad_optimized_bf16_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs.dir/generated/conv2d/80/s16816fprop_optimized_f16/cutlass_tensorop_s16816fprop_optimized_f16_256x128_32x3_nhwc_single_group_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_bf16/cutlass_tensorop_s16816wgrad_optimized_bf16_256x128_32x3_nhwc_align8.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_f16/all_sm80_s16816wgrad_optimized_f16_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/all_sm80_s1688bf16dgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs.dir/generated/conv2d/80/s16816wgrad_optimized_f16/cutlass_tensorop_s16816wgrad_optimized_f16_256x128_32x3_nhwc_align8.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/cutlass_tensorop_s1688bf16dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/all_sm80_s1688bf16fprop_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16dgrad_optimized/cutlass_tensorop_s1688bf16dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/cutlass_tensorop_s1688bf16fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16wgrad_optimized/all_sm80_s1688bf16wgrad_optimized_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs.dir/generated/conv2d/80/s1688bf16fprop_optimized/cutlass_tensorop_s1688bf16fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs.dir/generated/conv2d/80/s1688bf16wgrad_optimized/cutlass_tensorop_s1688bf16wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/all_sm80_s1688dgrad_optimized_conv2d_operations.cu.o [ 56%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_objs [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/all_sm80_s1688dgrad_optimized_tf32_conv2d_operations.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/cutlass_tensorop_s1688dgrad_optimized_128x128_16x4_nhwc_unity_stride_align4.cu.o [ 56%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/cutlass_tensorop_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs.dir/generated/conv2d/80/s1688dgrad_optimized/cutlass_tensorop_s1688dgrad_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688dgrad_optimized_tf32/cutlass_tensorop_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/all_sm80_s1688f16dgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/cutlass_tensorop_s1688f16dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/all_sm80_s1688f16fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/cutlass_tensorop_s1688f16fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs.dir/generated/conv2d/80/s1688f16dgrad_optimized/cutlass_tensorop_s1688f16dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs.dir/generated/conv2d/80/s1688f16fprop_optimized/cutlass_tensorop_s1688f16fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs.dir/generated/conv2d/80/s1688f16wgrad_optimized/all_sm80_s1688f16wgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/all_sm80_s1688fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs.dir/generated/conv2d/80/s1688f16wgrad_optimized/cutlass_tensorop_s1688f16wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/cutlass_tensorop_s1688fprop_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/all_sm80_s1688fprop_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_objs.dir/generated/conv2d/80/s1688fprop_optimized/cutlass_tensorop_s1688fprop_optimized_128x128_16x4_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/cutlass_tensorop_s1688fprop_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/all_sm80_s1688tf32dgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/s1688fprop_optimized_tf32/cutlass_tensorop_s1688fprop_optimized_tf32_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/cutlass_tensorop_s1688tf32dgrad_optimized_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/all_sm80_s1688tf32fprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32dgrad_optimized/cutlass_tensorop_s1688tf32dgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/cutlass_tensorop_s1688tf32fprop_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32wgrad_optimized/all_sm80_s1688tf32wgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs.dir/generated/conv2d/80/s1688tf32fprop_optimized/cutlass_tensorop_s1688tf32fprop_optimized_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs.dir/generated/conv2d/80/s1688tf32wgrad_optimized/cutlass_tensorop_s1688tf32wgrad_optimized_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs.dir/generated/conv2d/80/s1688wgrad_optimized/all_sm80_s1688wgrad_optimized_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688wgrad_optimized_tf32/all_sm80_s1688wgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs.dir/generated/conv2d/80/s1688wgrad_optimized/cutlass_tensorop_s1688wgrad_optimized_128x128_16x4_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/s1688wgrad_optimized_tf32/cutlass_tensorop_s1688wgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/all_sm80_s4_i16864fprop_optimized_s4_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nhwc_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_few_channels_s8/all_sm80_s8_i16832fprop_few_channels_s8_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_few_channels_s8/cutlass_tensorop_s8_i16832fprop_few_channels_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nhwc_single_group_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_fixed_channels_s8/all_sm80_s8_i16832fprop_fixed_channels_s8_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_fixed_channels_s8/cutlass_tensorop_s8_i16832fprop_fixed_channels_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs.dir/generated/conv2d/80/s4_i16864fprop_optimized_s4/cutlass_tensorop_s4_i16864fprop_optimized_s4_256x128_128x3_nc64hw64_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/all_sm80_s8_i16832fprop_optimized_s8_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/all_sm80_sdgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/cutlass_simt_sdgrad_optimized_256x128_8x5_nhwc_unity_stride_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nhwc_single_group_align16.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sdgrad_optimized_objs.dir/generated/conv2d/80/sdgrad_optimized/cutlass_simt_sdgrad_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs.dir/generated/conv2d/80/s8_i16832fprop_optimized_s8/cutlass_tensorop_s8_i16832fprop_optimized_s8_256x128_64x3_nc32hw32_align16.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sfprop_optimized_objs.dir/generated/conv2d/80/sfprop_optimized/all_sm80_sfprop_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_sfprop_optimized_objs.dir/generated/conv2d/80/sfprop_optimized/cutlass_simt_sfprop_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_swgrad_optimized_objs.dir/generated/conv2d/80/swgrad_optimized/all_sm80_swgrad_optimized_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_swgrad_optimized_objs.dir/generated/conv2d/80/swgrad_optimized/cutlass_simt_swgrad_optimized_256x128_8x5_nhwc_align1.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_sfprop_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/all_sm80_tf32_s1688dgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/cutlass_tensorop_tf32_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_unity_stride_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_swgrad_optimized_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/all_sm80_tf32_s1688fprop_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/cutlass_tensorop_tf32_s1688fprop_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688dgrad_optimized_tf32/cutlass_tensorop_tf32_s1688dgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688fprop_optimized_tf32/cutlass_tensorop_tf32_s1688fprop_optimized_tf32_256x128_16x3_nhwc_single_group_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688wgrad_optimized_tf32/all_sm80_tf32_s1688wgrad_optimized_tf32_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs.dir/generated/conv2d/80/tf32_s1688wgrad_optimized_tf32/cutlass_tensorop_tf32_s1688wgrad_optimized_tf32_256x128_16x3_nhwc_align4.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/all_sm80_u4_i16864fprop_optimized_u4_conv2d_operations.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nhwc_align32.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nhwc_single_group_align32.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs.dir/generated/conv2d/80/u4_i16864fprop_optimized_u4/cutlass_tensorop_u4_i16864fprop_optimized_u4_256x128_128x3_nc64hw64_align32.cu.o [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_few_channels_u8/all_sm80_u8_i16832fprop_few_channels_u8_conv2d_operations.cu.o [ 57%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_objs [ 57%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_fixed_channels_u8/all_sm80_u8_i16832fprop_fixed_channels_u8_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_few_channels_u8/cutlass_tensorop_u8_i16832fprop_few_channels_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_fixed_channels_u8/cutlass_tensorop_u8_i16832fprop_fixed_channels_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/all_sm80_u8_i16832fprop_optimized_u8_conv2d_operations.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e4m3/all_sm89_s16832fprop_fixed_channels_e4m3_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e4m3/cutlass_tensorop_s16832fprop_fixed_channels_e4m3_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nhwc_single_group_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e5m2/all_sm89_s16832fprop_fixed_channels_e5m2_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs.dir/generated/conv2d/89/s16832fprop_fixed_channels_e5m2/cutlass_tensorop_s16832fprop_fixed_channels_e5m2_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs.dir/generated/conv2d/80/u8_i16832fprop_optimized_u8/cutlass_tensorop_u8_i16832fprop_optimized_u8_256x128_64x3_nc32hw32_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/all_sm89_s16832fprop_optimized_e4m3_conv2d_operations.cu.o [ 58%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/all_sm89_s16832fprop_optimized_e5m2_conv2d_operations.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/cutlass_tensorop_s16832fprop_optimized_e4m3_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/cutlass_tensorop_s16832fprop_optimized_e5m2_256x128_64x3_nhwc_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs.dir/generated/conv2d/89/s16832fprop_optimized_e4m3/cutlass_tensorop_s16832fprop_optimized_e4m3_256x128_64x3_nhwc_single_group_align16.cu.o [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs.dir/generated/conv2d/89/s16832fprop_optimized_e5m2/cutlass_tensorop_s16832fprop_optimized_e5m2_256x128_64x3_nhwc_single_group_align16.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_objs [ 58%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/all_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_conv2d_operations.cu.o [ 58%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/all_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs.dir/generated/conv2d/90/h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/all_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_conv2d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/all_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_conv2d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs.dir/generated/conv2d/90/s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_analytic_bf16/all_sm80_bf16_s16816dgrad3d_analytic_bf16_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_analytic_bf16/cutlass_tensorop_bf16_s16816dgrad3d_analytic_bf16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_optimized_bf16/all_sm80_bf16_s16816dgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816dgrad3d_optimized_bf16/cutlass_tensorop_bf16_s16816dgrad3d_optimized_bf16_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816fprop3d_optimized_bf16/all_sm80_bf16_s16816fprop3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816wgrad3d_optimized_bf16/all_sm80_bf16_s16816wgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816fprop3d_optimized_bf16/cutlass_tensorop_bf16_s16816fprop3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/bf16_s16816wgrad3d_optimized_bf16/cutlass_tensorop_bf16_s16816wgrad3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_analytic_f16/all_sm80_f16_s16816dgrad3d_analytic_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_optimized_f16/all_sm80_f16_s16816dgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_analytic_f16/cutlass_tensorop_f16_s16816dgrad3d_analytic_f16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816dgrad3d_optimized_f16/cutlass_tensorop_f16_s16816dgrad3d_optimized_f16_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816fprop3d_optimized_f16/all_sm80_f16_s16816fprop3d_optimized_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816wgrad3d_optimized_f16/all_sm80_f16_s16816wgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816fprop3d_optimized_f16/cutlass_tensorop_f16_s16816fprop3d_optimized_f16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/f16_s16816wgrad3d_optimized_f16/cutlass_tensorop_f16_s16816wgrad3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs.dir/generated/conv3d/80/h16816dgrad3d_analytic/all_sm80_h16816dgrad3d_analytic_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs.dir/generated/conv3d/80/h16816dgrad3d_optimized/all_sm80_h16816dgrad3d_optimized_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs.dir/generated/conv3d/80/h16816dgrad3d_analytic/cutlass_tensorop_h16816dgrad3d_analytic_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs.dir/generated/conv3d/80/h16816dgrad3d_optimized/cutlass_tensorop_h16816dgrad3d_optimized_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs.dir/generated/conv3d/80/h16816fprop3d_optimized/all_sm80_h16816fprop3d_optimized_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs.dir/generated/conv3d/80/h16816wgrad3d_optimized/all_sm80_h16816wgrad3d_optimized_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs.dir/generated/conv3d/80/h16816fprop3d_optimized/cutlass_tensorop_h16816fprop3d_optimized_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs.dir/generated/conv3d/80/h16816wgrad3d_optimized/cutlass_tensorop_h16816wgrad3d_optimized_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_bf16/all_sm80_s16816dgrad3d_analytic_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_f16/all_sm80_s16816dgrad3d_analytic_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_bf16/cutlass_tensorop_s16816dgrad3d_analytic_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_analytic_f16/cutlass_tensorop_s16816dgrad3d_analytic_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_bf16/all_sm80_s16816dgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_f16/all_sm80_s16816dgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_bf16/cutlass_tensorop_s16816dgrad3d_optimized_bf16_256x128_32x3_unity_stride.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816dgrad3d_optimized_f16/cutlass_tensorop_s16816dgrad3d_optimized_f16_256x128_32x3_unity_stride.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_bf16/all_sm80_s16816fprop3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_f16/all_sm80_s16816fprop3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_bf16/cutlass_tensorop_s16816fprop3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs.dir/generated/conv3d/80/s16816fprop3d_optimized_f16/cutlass_tensorop_s16816fprop3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_bf16/all_sm80_s16816wgrad3d_optimized_bf16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_f16/all_sm80_s16816wgrad3d_optimized_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_bf16/cutlass_tensorop_s16816wgrad3d_optimized_bf16_256x128_32x3.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs.dir/generated/conv3d/80/s16816wgrad3d_optimized_f16/cutlass_tensorop_s16816wgrad3d_optimized_f16_256x128_32x3.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/all_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_conv3d_operations.cu.o [ 59%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/all_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs.dir/generated/conv3d/90/h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16/cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = cutlass::half_t; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = cutlass::half_t; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::SM75_U32x4_LDSM_N; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::SM90_U32x4_STSM_N]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<13> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::half_t, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM75_U32x4_LDSM_N, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::SM90_U32x4_STSM_N>, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, cutlass::half_t>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/all_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/all_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_conv3d_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<64>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/all_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_conv3d_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_2x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs.dir/generated/conv3d/90/s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32/cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::bfloat16_t, cutlass::bfloat16_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/all_sm80_c1688herk_rank_k_operations.cu.o /builddir/build/BUILD/cutlass/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp: In instantiation of 'static constexpr cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Params cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::to_underlying_arguments(const ProblemShape&, const cutlass::epilogue::collective::CollectiveEpilogue, CtaTileMNK_, EpilogueTile_, ElementC_, StrideC_, ElementD_, StrideD_, FusionCallbacks_, CopyOpG2S_, SmemLayoutAtomC_, CopyOpS2R_, CopyOpS2G_, SmemLayoutAtomD_, CopyOpR2S_>::Arguments&, void*) [with ProblemShape = cute::tuple, int, cute::tuple, cute::C<1> >; int StagesC_ = 3; int StagesD_ = 2; int FragmentSize_ = 16; bool ReuseSmemC_ = true; bool DelayTmaStore_ = false; CtaTileMNK_ = cute::tuple, cute::C<64>, cute::tuple > >; EpilogueTile_ = cute::tuple, cute::C<32> >; ElementC_ = float; StrideC_ = cute::tuple, cute::C<1>, cute::C<0> >; ElementD_ = float; StrideD_ = cute::tuple, cute::C<1>, cute::C<0> >; FusionCallbacks_ = cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >; CopyOpG2S_ = cute::SM90_TMA_LOAD_IM2COL; SmemLayoutAtomC_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpS2R_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>; CopyOpS2G_ = cute::SM90_TMA_STORE_IM2COL; SmemLayoutAtomD_ = cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >; CopyOpR2S_ = cute::AutoVectorizingCopyWithAssumedAlignment<128>]': /builddir/build/BUILD/cutlass/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp:149:69: required from 'static cutlass::conv::kernel::ConvUniversal, void>::type>::Params cutlass::conv::kernel::ConvUniversal, void>::type>::to_underlying_arguments(const cutlass::conv::kernel::ConvUniversal, void>::type>::Arguments&, void*) [with CollectiveMainloop_ = cutlass::conv::collective::CollectiveConv, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >; CollectiveEpilogue_ = cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >; TileSchedulerTag = void]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:257:48: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::initialize(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/include/cutlass/conv/device/conv_universal_adapter.hpp:384:17: required from 'cutlass::Status cutlass::conv::device::ConvUniversalAdapter::run(const Arguments&, void*, cudaStream_t, cutlass::CudaHostAdapter*) [with ConvKernel_ = cutlass3x_sm90_tensorop_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_64x64x64_1x1x1_0_align16_warpspecialized_epi_tma; cutlass::conv::device::ConvUniversalAdapter::Arguments = cutlass::conv::kernel::ConvUniversal, cute::C<1>, cute::C<1> >, cutlass::conv::KernelImplicitTmaWarpSpecializedSm90, 1>, cute::tuple, cute::C<64>, cute::tuple > >, cutlass::half_t, cutlass::half_t, cute::TiledMMA >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple >, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void>, cutlass::conv::collective::detail::Sm90ImplicitGemmTileTraits, cute::smem_ptr_flag_bits<16>, cute::Layout, cute::C<64>, cute::C<12> >, cute::tuple, cute::C<1>, cute::C<4096> > > >, void> >, cutlass::epilogue::collective::CollectiveEpilogue, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> >, float, cute::tuple, cute::C<1>, cute::C<0> >, float, cute::tuple, cute::C<1>, cute::C<0> >, cutlass::epilogue::fusion::FusionCallbacks, cutlass::epilogue::fusion::LinearCombination, cute::tuple, cute::C<64>, cute::tuple > >, cute::tuple, cute::C<32> > >, cute::SM90_TMA_LOAD_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128>, cute::SM90_TMA_STORE_IM2COL, cute::ComposedLayout, cute::smem_ptr_flag_bits<32>, cute::Layout, cute::C<32> >, cute::tuple, cute::C<1> > > >, cute::AutoVectorizingCopyWithAssumedAlignment<128> >, void>::Arguments; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:361:50: required from 'cutlass::Status cutlass::library::ConvOperation3x::run(const void*, void*, void*, cudaStream_t) const [with Operator_ = cutlass::conv::device::ConvUniversalAdapter; cudaStream_t = CUstream_st*]' /builddir/build/BUILD/cutlass/tools/library/src/conv_operation_3x.hpp:331:16: required from here /builddir/build/BUILD/cutlass/include/cute/atom/copy_atom.hpp:141:8: note: 'using TMA_D = struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >' {aka 'struct cute::TiledCopy, cute::Tensor, cute::tuple, cute::C<0>, cute::C<0> >, cute::C<0>, cute::tuple, cute::C<0>, cute::C<0> > > > >, cute::ComposedLayout, cute::tuple, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 2> >, cute::tuple, 0>, cute::ScaledBasis, 0>, 3>, cute::ScaledBasis, 1>, 3>, cute::ScaledBasis, 2>, 3> > > >, cute::ArithmeticTuple, cute::ArithmeticTuple, cute::C<0>, cute::C<0>, cute::C<0> > >, cute::Layout, cute::C<1>, cute::C<1> > >, cute::tuple, 0>, cute::tuple, 0>, 1>, cute::ScaledBasis, 1>, 1>, cute::ScaledBasis, 2>, 1>, cute::ScaledBasis, 3>, 1> > > > > > >, float>, cute::Layout, cute::tuple, cute::C<64> > > >, cute::tuple, cute::tuple, cute::C<1> > > > >, cute::tuple, cute::C<32> > >'} has no user-provided default constructor struct TiledCopy : Copy_Atom ^~~~~~~~~ /usr/local/cuda-12.4/include/cuda.h:3291:10: note: and the implicitly-defined constructor does not initialize 'cuuint64_t CUtensorMap_st::opaque [16]' cuuint64_t opaque[CU_TENSOR_MAP_NUM_QWORDS]; ^ [ 59%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/all_sm80_c1688syrk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_h_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_t_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688herk_objs.dir/generated/rank_k/80/c1688herk/cutlass_tensorop_c1688herk_128x64_16x4_h_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688syrk_objs.dir/generated/rank_k/80/c1688syrk/cutlass_tensorop_c1688syrk_128x64_16x4_t_u_align1.cu.o [ 59%] Built target cutlass_library_rank_k_sm80_c1688herk_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/all_sm80_c1688tf32herk_rank_k_operations.cu.o [ 59%] Built target cutlass_library_rank_k_sm80_c1688syrk_objs [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/all_sm80_c1688tf32syrk_rank_k_operations.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_n_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_n_u_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_t_l_align1.cu.o [ 59%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32syrk_objs.dir/generated/rank_k/80/c1688tf32syrk/cutlass_tensorop_c1688tf32syrk_128x64_16x4_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_c1688tf32herk_objs.dir/generated/rank_k/80/c1688tf32herk/cutlass_tensorop_c1688tf32herk_128x64_16x4_h_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/all_sm80_d884syrk_rank_k_operations.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_c1688tf32herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/all_sm80_gz884herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884herk_objs.dir/generated/rank_k/80/gz884herk/cutlass_tensorop_gz884herk_64x64_8x3_h_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_gz884herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/all_sm80_gz884syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_d884syrk_objs.dir/generated/rank_k/80/d884syrk/cutlass_tensorop_d884syrk_128x128_16x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_d884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/all_sm80_s1688syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_gz884syrk_objs.dir/generated/rank_k/80/gz884syrk/cutlass_tensorop_gz884syrk_64x64_8x3_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_gz884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/all_sm80_s1688tf32syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688syrk_objs.dir/generated/rank_k/80/s1688syrk/cutlass_tensorop_s1688syrk_256x128_16x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_s1688tf32syrk_objs.dir/generated/rank_k/80/s1688tf32syrk/cutlass_tensorop_s1688tf32syrk_256x128_16x3_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_s1688syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/all_sm80_z884herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_n_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/all_sm80_z884syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884herk_objs.dir/generated/rank_k/80/z884herk/cutlass_tensorop_z884herk_128x64_8x3_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_t_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_z884herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/all_sm90_d1684syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm80_z884syrk_objs.dir/generated/rank_k/80/z884syrk/cutlass_tensorop_z884syrk_128x64_8x3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm80_z884syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/all_sm90_gz1684herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_d1684syrk_objs.dir/generated/rank_k/90/d1684syrk/cutlass_tensorop_d1684syrk_128x128x16_1x1x1_3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684herk_objs.dir/generated/rank_k/90/gz1684herk/cutlass_tensorop_gz1684herk_64x64x8_1x1x1_3_h_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_d1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/all_sm90_gz1684syrk_rank_k_operations.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_gz1684herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/all_sm90_z1684herk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_gz1684syrk_objs.dir/generated/rank_k/90/gz1684syrk/cutlass_tensorop_gz1684syrk_64x64x8_1x1x1_3_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684herk_objs.dir/generated/rank_k/90/z1684herk/cutlass_tensorop_z1684herk_128x64x8_1x1x1_3_h_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_gz1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/all_sm90_z1684syrk_rank_k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_z1684herk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/all_sm80_c1688her2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_k_sm90_z1684syrk_objs.dir/generated/rank_k/90/z1684syrk/cutlass_tensorop_z1684syrk_128x64x8_1x1x1_3_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_k_sm90_z1684syrk_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/all_sm80_c1688syr2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688her2k_objs.dir/generated/rank_2k/80/c1688her2k/cutlass_tensorop_c1688her2k_128x64_16x4_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_t_l_align1.cu.o [ 60%] Built target cutlass_library_rank_2k_sm80_c1688her2k_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/all_sm80_c1688tf32her2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688syr2k_objs.dir/generated/rank_2k/80/c1688syr2k/cutlass_tensorop_c1688syr2k_128x64_16x4_t_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_n_l_align1.cu.o [ 60%] Built target cutlass_library_rank_2k_sm80_c1688syr2k_objs [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/all_sm80_c1688tf32syr2k_rank_2k_operations.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_n_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_n_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_h_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_t_l_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32her2k_objs.dir/generated/rank_2k/80/c1688tf32her2k/cutlass_tensorop_c1688tf32her2k_128x64_16x4_h_u_align1.cu.o [ 60%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs.dir/generated/rank_2k/80/c1688tf32syr2k/cutlass_tensorop_c1688tf32syr2k_128x64_16x4_t_u_align1.cu.o [ 60%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/all_sm80_d884syr2k_rank_2k_operations.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/all_sm80_gz884her2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_h_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_t_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884her2k_objs.dir/generated/rank_2k/80/gz884her2k/cutlass_tensorop_gz884her2k_64x64_8x3_h_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_d884syr2k_objs.dir/generated/rank_2k/80/d884syr2k/cutlass_tensorop_d884syr2k_128x128_16x3_t_u_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_gz884her2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/all_sm80_gz884syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_n_l_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_d884syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/all_sm80_s1688syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_t_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_gz884syr2k_objs.dir/generated/rank_2k/80/gz884syr2k/cutlass_tensorop_gz884syr2k_64x64_8x3_t_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_n_u_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_gz884syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/all_sm80_s1688tf32syr2k_rank_2k_operations.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_n_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_t_l_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_n_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688syr2k_objs.dir/generated/rank_2k/80/s1688syr2k/cutlass_tensorop_s1688syr2k_256x128_16x3_t_u_align1.cu.o [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_t_l_align1.cu.o [ 61%] Built target cutlass_library_rank_2k_sm80_s1688syr2k_objs [ 61%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/all_sm80_z884her2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs.dir/generated/rank_2k/80/s1688tf32syr2k/cutlass_tensorop_s1688tf32syr2k_256x128_16x3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_n_u_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/all_sm80_z884syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_h_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884her2k_objs.dir/generated/rank_2k/80/z884her2k/cutlass_tensorop_z884her2k_128x64_8x3_h_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_t_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_z884her2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/all_sm90_d1684syr2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm80_z884syr2k_objs.dir/generated/rank_2k/80/z884syr2k/cutlass_tensorop_z884syr2k_128x64_8x3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_n_l_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm80_z884syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/all_sm90_gz1684her2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_h_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_d1684syr2k_objs.dir/generated/rank_2k/90/d1684syr2k/cutlass_tensorop_d1684syr2k_128x128x16_1x1x1_3_t_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684her2k_objs.dir/generated/rank_2k/90/gz1684her2k/cutlass_tensorop_gz1684her2k_64x64x8_1x1x1_3_h_u_align1.cu.o [ 62%] Built target cutlass_library_rank_2k_sm90_d1684syr2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/all_sm90_gz1684syr2k_rank_2k_operations.cu.o [ 62%] Built target cutlass_library_rank_2k_sm90_gz1684her2k_objs [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/all_sm90_z1684her2k_rank_2k_operations.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_n_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_n_u_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_t_l_align1.cu.o [ 62%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_gz1684syr2k_objs.dir/generated/rank_2k/90/gz1684syr2k/cutlass_tensorop_gz1684syr2k_64x64x8_1x1x1_3_t_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_h_l_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k_objs [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/all_sm90_z1684syr2k_rank_2k_operations.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_n_l_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684her2k_objs.dir/generated/rank_2k/90/z1684her2k/cutlass_tensorop_z1684her2k_128x64x8_1x1x1_3_h_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_n_u_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_z1684her2k_objs [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/all_sm80_c1688tf32trmm_trmm_operations.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_t_l_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_rank_2k_sm90_z1684syr2k_objs.dir/generated/rank_2k/90/z1684syr2k/cutlass_tensorop_z1684syr2k_128x64x8_1x1x1_3_t_u_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_l_un_align1.cu.o [ 63%] Built target cutlass_library_rank_2k_sm90_z1684syr2k_objs [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/all_sm80_c1688trmm_trmm_operations.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_ls_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_l_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_u_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_l_nu_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_nn_rs_u_un_align1.cu.o [ 63%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_cn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_nn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_l_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_cn_rs_u_un_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_u_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_l_nu_align1.cu.o [ 64%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_tn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688tf32trmm_objs.dir/generated/trmm/80/c1688tf32trmm/cutlass_tensorop_c1688tf32trmm_128x64_16x4_hn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_l_nu_align1.cu.o [ 65%] Built target cutlass_library_trmm_sm80_c1688tf32trmm_objs [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/all_sm80_d884trmm_trmm_operations.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_ls_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_u_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_l_nu_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_tn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_l_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_c1688trmm_objs.dir/generated/trmm/80/c1688trmm/cutlass_tensorop_c1688trmm_128x64_16x4_hn_rs_u_un_align1.cu.o [ 65%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_u_nu_align1.cu.o [ 65%] Built target cutlass_library_trmm_sm80_c1688trmm_objs [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/all_sm80_gz884trmm_trmm_operations.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_nn_rs_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_ls_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_ls_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_ls_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_l_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_u_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_l_nu_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_d884trmm_objs.dir/generated/trmm/80/d884trmm/cutlass_tensorop_d884trmm_128x128_16x3_tn_rs_u_un_align1.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_l_un_align1.cu.o [ 66%] Built target cutlass_library_trmm_sm80_d884trmm_objs [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/all_sm80_s1688tf32trmm_trmm_operations.cu.o [ 66%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_nn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_cn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_nn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688tf32trmm_objs.dir/generated/trmm/80/s1688tf32trmm/cutlass_tensorop_s1688tf32trmm_256x128_16x3_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_gz884trmm_objs.dir/generated/trmm/80/gz884trmm/cutlass_tensorop_gz884trmm_64x64_8x3_hn_rs_u_un_align1.cu.o [ 67%] Built target cutlass_library_trmm_sm80_gz884trmm_objs [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/all_sm80_s1688trmm_trmm_operations.cu.o [ 67%] Built target cutlass_library_trmm_sm80_s1688tf32trmm_objs [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/all_sm80_z884trmm_trmm_operations.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_nn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_nn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_ls_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_cn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_l_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_u_nu_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_l_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_s1688trmm_objs.dir/generated/trmm/80/s1688trmm/cutlass_tensorop_s1688trmm_256x128_16x3_tn_rs_u_un_align1.cu.o [ 67%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_u_nu_align1.cu.o [ 67%] Built target cutlass_library_trmm_sm80_s1688trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/all_sm90_d1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_tn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm80_z884trmm_objs.dir/generated/trmm/80/z884trmm/cutlass_tensorop_z884trmm_128x64_8x3_hn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 68%] Built target cutlass_library_trmm_sm80_z884trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/all_sm90_gz1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_d1684trmm_objs.dir/generated/trmm/90/d1684trmm/cutlass_tensorop_d1684trmm_128x128x16_1x1x1_3_tn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_ls_u_un_align1.cu.o [ 68%] Built target cutlass_library_trmm_sm90_d1684trmm_objs [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/all_sm90_z1684trmm_trmm_operations.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_l_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_l_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_u_nu_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_nn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_ls_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_cn_rs_u_un_align1.cu.o [ 68%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_l_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_u_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_nn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_cn_rs_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_ls_u_un_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_l_nu_align1.cu.o [ 69%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_ls_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_tn_rs_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_gz1684trmm_objs.dir/generated/trmm/90/gz1684trmm/cutlass_tensorop_gz1684trmm_64x64x8_1x1x1_3_hn_rs_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_ls_u_un_align1.cu.o [ 70%] Built target cutlass_library_trmm_sm90_gz1684trmm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/all_sm80_c1688hemm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_l_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_l_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_u_nu_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688hemm_objs.dir/generated/symm/80/c1688hemm/cutlass_tensorop_c1688hemm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_tn_rs_u_un_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_trmm_sm90_z1684trmm_objs.dir/generated/trmm/90/z1684trmm/cutlass_tensorop_z1684trmm_128x64x8_1x1x1_3_hn_rs_u_un_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688hemm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/all_sm80_c1688symm_symm_operations.cu.o [ 70%] Built target cutlass_library_trmm_sm90_z1684trmm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/all_sm80_c1688tf32hemm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32hemm_objs.dir/generated/symm/80/c1688tf32hemm/cutlass_tensorop_c1688tf32hemm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688symm_objs.dir/generated/symm/80/c1688symm/cutlass_tensorop_c1688symm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688tf32hemm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/all_sm80_c1688tf32symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_ls_l_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688symm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/all_sm80_d884symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_c1688tf32symm_objs.dir/generated/symm/80/c1688tf32symm/cutlass_tensorop_c1688tf32symm_128x64_16x4_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_d884symm_objs.dir/generated/symm/80/d884symm/cutlass_tensorop_d884symm_128x128_16x3_n_rs_u_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_c1688tf32symm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/all_sm80_gz884hemm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_ls_l_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_d884symm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/all_sm80_gz884symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_ls_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_ls_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_rs_l_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884hemm_objs.dir/generated/symm/80/gz884hemm/cutlass_tensorop_gz884hemm_64x64_8x3_n_rs_u_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_gz884hemm_objs [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/all_sm80_s1688symm_symm_operations.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_gz884symm_objs.dir/generated/symm/80/gz884symm/cutlass_tensorop_gz884symm_64x64_8x3_n_rs_u_align1.cu.o [ 70%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_ls_l_align1.cu.o [ 70%] Built target cutlass_library_symm_sm80_gz884symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/all_sm80_s1688tf32symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688symm_objs.dir/generated/symm/80/s1688symm/cutlass_tensorop_s1688symm_256x128_16x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_s1688tf32symm_objs.dir/generated/symm/80/s1688tf32symm/cutlass_tensorop_s1688tf32symm_256x128_16x3_n_rs_u_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_s1688symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/all_sm80_z884hemm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_ls_l_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_s1688tf32symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/all_sm80_z884symm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884hemm_objs.dir/generated/symm/80/z884hemm/cutlass_tensorop_z884hemm_128x64_8x3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm80_z884symm_objs.dir/generated/symm/80/z884symm/cutlass_tensorop_z884symm_128x64_8x3_n_rs_u_align1.cu.o [ 71%] Built target cutlass_library_symm_sm80_z884hemm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/all_sm90_d1684symm_symm_operations.cu.o [ 71%] Built target cutlass_library_symm_sm80_z884symm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/all_sm90_gz1684hemm_symm_operations.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_rs_l_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684hemm_objs.dir/generated/symm/90/gz1684hemm/cutlass_tensorop_gz1684hemm_64x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_d1684symm_objs.dir/generated/symm/90/d1684symm/cutlass_tensorop_d1684symm_128x128x16_1x1x1_3_n_rs_u_align1.cu.o [ 71%] Built target cutlass_library_symm_sm90_gz1684hemm_objs [ 71%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/all_sm90_gz1684symm_symm_operations.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 72%] Built target cutlass_library_symm_sm90_d1684symm_objs [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/all_sm90_z1684hemm_symm_operations.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_ls_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_ls_u_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_rs_l_align1.cu.o [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_gz1684symm_objs.dir/generated/symm/90/gz1684symm/cutlass_tensorop_gz1684symm_64x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Built target cutlass_library_symm_sm90_gz1684symm_objs [ 72%] Linking CUDA static library libcutlass_symm_sm90_z1684symm.a [ 72%] Built target cutlass_library_symm_sm90_z1684symm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_cgemm.a [ 72%] Built target cutlass_library_gemm_sm50_cgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_dgemm.a [ 72%] Built target cutlass_library_gemm_sm50_dgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm50_sgemm.a [ 72%] Built target cutlass_library_gemm_sm50_sgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm60_hgemm.a [ 72%] Built target cutlass_library_gemm_sm60_hgemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm61_igemm_s8.a [ 72%] Built target cutlass_library_gemm_sm61_igemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm61_s8_igemm_s8.a [ 72%] Built target cutlass_library_gemm_sm61_s8_igemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm_planar_complex.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_h884gemm_planar_complex_array.a [ 72%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm70_s884gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm.a [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm_planar_complex.a [ 72%] Building CUDA object tools/library/CMakeFiles/cutlass_library_symm_sm90_z1684hemm_objs.dir/generated/symm/90/z1684hemm/cutlass_tensorop_z1684hemm_128x64x8_1x1x1_3_n_rs_u_align1.cu.o [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_h1688gemm_planar_complex_array.a [ 72%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i88128xorgemm_b1.a [ 72%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8816gemm_s8.a [ 72%] Built target cutlass_library_gemm_sm75_i8816gemm_s8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8816gemm_u8.a [ 72%] Built target cutlass_library_gemm_sm75_i8816gemm_u8_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8832gemm_s4.a [ 72%] Built target cutlass_library_gemm_sm75_i8832gemm_s4_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_i8832gemm_u4.a [ 72%] Built target cutlass_library_gemm_sm75_i8832gemm_u4_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_f16.a [ 72%] Built target cutlass_library_gemm_sm75_s1688gemm_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.a [ 72%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16_static [ 72%] Linking CUDA static library libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.a [ 72%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_s4_i8832gemm_s4.a [ 73%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_s8_i8816gemm_s8.a [ 73%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_u4_i8832gemm_u4.a [ 73%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4_static [ 73%] Linking CUDA static library libcutlass_gemm_sm75_u8_i8816gemm_u8.a [ 73%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16_static [ 73%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.a [ 73%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_c1688gemm.a [ 74%] Built target cutlass_library_gemm_sm80_c1688gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_c1688tf32gemm.a [ 74%] Built target cutlass_library_gemm_sm80_c1688tf32gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_cgemm.a [ 74%] Built target cutlass_library_gemm_sm80_cgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_d884gemm.a [ 74%] Built target cutlass_library_gemm_sm80_d884gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_dgemm.a [ 74%] Built target cutlass_library_gemm_sm80_dgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.a [ 74%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_f16_s16832spgemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.a [ 74%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_gz884gemm.a [ 74%] Built target cutlass_library_gemm_sm80_gz884gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_grouped.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_planar_complex.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_planar_complex_array.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16816gemm_s8_f16.a [ 74%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_h16832spgemm.a [ 74%] Built target cutlass_library_gemm_sm80_h16832spgemm_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168128spgemm_s4.a [ 74%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168256andgemm_b1.a [ 74%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i168256xorgemm_b1.a [ 74%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16832gemm_s8.a [ 74%] Built target cutlass_library_gemm_sm80_i16832gemm_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16832gemm_u8.a [ 74%] Built target cutlass_library_gemm_sm80_i16832gemm_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864gemm_s4.a [ 74%] Built target cutlass_library_gemm_sm80_i16864gemm_s4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864gemm_u4.a [ 74%] Built target cutlass_library_gemm_sm80_i16864gemm_u4_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_i16864spgemm_s8.a [ 74%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16_s8.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_bf16_u8.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8_static [ 74%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16.a [ 74%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16_s8.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_f16_u8.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_grouped_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_grouped_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_s8_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_s8_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_u8_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816gemm_u8_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16816tf32spgemm.a [ 75%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16832spgemm_bf16.a [ 75%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s16832spgemm_f16.a [ 75%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688bf16gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s1688bf16gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688f16gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s1688f16gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s1688gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688gemm_tf32.a [ 75%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s1688tf32gemm.a [ 75%] Built target cutlass_library_gemm_sm80_s1688tf32gemm_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s4_i168128spgemm_s4.a [ 75%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4_static [ 75%] Linking CUDA static library libcutlass_gemm_sm80_s4_i16864gemm_s4.a [ 75%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_s8_i16832gemm_s8.a [ 76%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_s8_i16864spgemm_s8.a [ 76%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_sgemm.a [ 76%] Built target cutlass_library_gemm_sm80_sgemm_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_tf32_s1688gemm_tf32.a [ 76%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_u4_i16864gemm_u4.a [ 76%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_u8_i16832gemm_u8.a [ 76%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8_static [ 76%] Linking CUDA static library libcutlass_gemm_sm80_z884gemm.a [ 76%] Built target cutlass_library_gemm_sm80_z884gemm_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16fprop_optimized.a [ 76%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3_static [ 76%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.a [ 76%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e4m3.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2_static [ 76%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e5m2.a [ 76%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.a [ 77%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.a [ 77%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_d1684gemm.a [ 77%] Built target cutlass_library_gemm_sm90_d1684gemm_static [ 77%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.a [ 77%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_static [ 78%] Linking CUDA static library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.a [ 78%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_gz1684gemm.a [ 79%] Built target cutlass_library_gemm_sm90_gz1684gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_h64x128x16gemm.a [ 79%] Built target cutlass_library_gemm_sm90_h64x128x16gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_i64x128x32gemm_u8.a [ 79%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_bf16.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x16gemm_f16.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x8gemm_tf32.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s64x128x8tf32gemm.a [ 79%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.a [ 79%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_i64x128x32gemm_s8.a [ 79%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_i64x128x32gemm_u8.a [ 79%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x16gemm_f16.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.a [ 79%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3_static [ 79%] Linking CUDA static library libcutlass_gemm_sm90_z1684gemm.a [ 79%] Built target cutlass_library_gemm_sm90_z1684gemm_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_sdgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_sfprop_optimized.a [ 79%] Built target cutlass_library_conv2d_sm50_sfprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm50_swgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm50_swgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm60_hfprop_optimized.a [ 79%] Built target cutlass_library_conv2d_sm60_hfprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884dgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884fprop_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_h884wgrad_optimized.a [ 79%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm70_s884wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.a [ 79%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16_static [ 79%] Linking CUDA static library libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.a [ 79%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688dgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_few_channels.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_fixed_channels.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688fprop_optimized.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_h1688wgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8816fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8816fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8832fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_i8832fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688fprop_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816dgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816fprop_fixed_channels.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816fprop_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_h16816wgrad_optimized.a [ 80%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16832fprop_optimized_s8.a [ 80%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16832fprop_optimized_u8.a [ 80%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16864fprop_optimized_s4.a [ 80%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_i16864fprop_optimized_u4.a [ 80%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.a [ 80%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16_static [ 80%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.a [ 80%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816fprop_optimized_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.a [ 81%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688f16wgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688tf32fprop_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688wgrad_optimized.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.a [ 81%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.a [ 81%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.a [ 81%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8_static [ 81%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.a [ 81%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.a [ 82%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_sdgrad_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_sfprop_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_sfprop_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_swgrad_optimized.a [ 82%] Built target cutlass_library_conv2d_sm80_swgrad_optimized_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.a [ 82%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.a [ 82%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.a [ 82%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.a [ 82%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816dgrad3d_analytic.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816dgrad3d_optimized.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816fprop3d_optimized.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_h16816wgrad3d_optimized.a [ 82%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.a [ 82%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.a [ 82%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.a [ 82%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688herk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688tf32herk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688tf32herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_c1688tf32syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_d884syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_d884syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_gz884herk.a [ 82%] Built target cutlass_library_rank_k_sm80_gz884herk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_gz884syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_gz884syrk_static [ 82%] Linking CUDA static library libcutlass_rank_k_sm80_s1688syrk.a [ 82%] Built target cutlass_library_rank_k_sm80_s1688syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_s1688tf32syrk.a [ 83%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_z884herk.a [ 83%] Built target cutlass_library_rank_k_sm80_z884herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm80_z884syrk.a [ 83%] Built target cutlass_library_rank_k_sm80_z884syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_d1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_d1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_gz1684herk.a [ 83%] Built target cutlass_library_rank_k_sm90_gz1684herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_gz1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_gz1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_z1684herk.a [ 83%] Built target cutlass_library_rank_k_sm90_z1684herk_static [ 83%] Linking CUDA static library libcutlass_rank_k_sm90_z1684syrk.a [ 83%] Built target cutlass_library_rank_k_sm90_z1684syrk_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688tf32her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_c1688tf32syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_d884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_d884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_gz884her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_gz884her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_gz884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_gz884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_s1688syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_s1688syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_s1688tf32syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_z884her2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_z884her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm80_z884syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm80_z884syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_d1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_d1684syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_gz1684her2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_gz1684her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_gz1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_z1684her2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_z1684her2k_static [ 83%] Linking CUDA static library libcutlass_rank_2k_sm90_z1684syr2k.a [ 83%] Built target cutlass_library_rank_2k_sm90_z1684syr2k_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_c1688tf32trmm.a [ 83%] Built target cutlass_library_trmm_sm80_c1688tf32trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_c1688trmm.a [ 83%] Built target cutlass_library_trmm_sm80_c1688trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_d884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_d884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_gz884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_gz884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_s1688tf32trmm.a [ 83%] Built target cutlass_library_trmm_sm80_s1688tf32trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_s1688trmm.a [ 83%] Built target cutlass_library_trmm_sm80_s1688trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm80_z884trmm.a [ 83%] Built target cutlass_library_trmm_sm80_z884trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_d1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_d1684trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_gz1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_gz1684trmm_static [ 83%] Linking CUDA static library libcutlass_trmm_sm90_z1684trmm.a [ 83%] Built target cutlass_library_trmm_sm90_z1684trmm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688hemm.a [ 83%] Built target cutlass_library_symm_sm80_c1688hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688symm.a [ 83%] Built target cutlass_library_symm_sm80_c1688symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688tf32hemm.a [ 83%] Built target cutlass_library_symm_sm80_c1688tf32hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_c1688tf32symm.a [ 83%] Built target cutlass_library_symm_sm80_c1688tf32symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_d884symm.a [ 83%] Built target cutlass_library_symm_sm80_d884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_gz884hemm.a [ 83%] Built target cutlass_library_symm_sm80_gz884hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_gz884symm.a [ 83%] Built target cutlass_library_symm_sm80_gz884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_s1688symm.a [ 83%] Built target cutlass_library_symm_sm80_s1688symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_s1688tf32symm.a [ 83%] Built target cutlass_library_symm_sm80_s1688tf32symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_z884hemm.a [ 83%] Built target cutlass_library_symm_sm80_z884hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm80_z884symm.a [ 83%] Built target cutlass_library_symm_sm80_z884symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_d1684symm.a [ 83%] Built target cutlass_library_symm_sm90_d1684symm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_gz1684hemm.a [ 83%] Built target cutlass_library_symm_sm90_gz1684hemm_static [ 83%] Linking CUDA static library libcutlass_symm_sm90_gz1684symm.a [ 83%] Built target cutlass_library_symm_sm90_gz1684symm_static [ 83%] Linking CUDA shared library libcutlass_symm_sm90_z1684symm.so [ 83%] Built target cutlass_library_symm_sm90_z1684symm [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_cgemm.so [ 83%] Built target cutlass_library_gemm_sm50_cgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_dgemm.so [ 83%] Built target cutlass_library_gemm_sm50_dgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm50_sgemm.so [ 83%] Built target cutlass_library_gemm_sm50_sgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm60_hgemm.so [ 83%] Built target cutlass_library_gemm_sm60_hgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm61_igemm_s8.so [ 83%] Built target cutlass_library_gemm_sm61_igemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm61_s8_igemm_s8.so [ 83%] Built target cutlass_library_gemm_sm61_s8_igemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm70_f16_s884gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm.so [ 83%] Built target cutlass_library_gemm_sm70_h884gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm_planar_complex.so [ 83%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_h884gemm_planar_complex_array.so [ 83%] Built target cutlass_library_gemm_sm70_h884gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm70_s884gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm75_f16_s1688gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm.so [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm_planar_complex.so [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so [ 83%] Built target cutlass_library_gemm_sm75_h1688gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i88128xorgemm_b1.so [ 83%] Built target cutlass_library_gemm_sm75_i88128xorgemm_b1 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8816gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm75_i8816gemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8816gemm_u8.so [ 83%] Built target cutlass_library_gemm_sm75_i8816gemm_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8832gemm_s4.so [ 83%] Built target cutlass_library_gemm_sm75_i8832gemm_s4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_i8832gemm_u4.so [ 83%] Built target cutlass_library_gemm_sm75_i8832gemm_u4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm75_s1688gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s4_i8832gemm_s4.so [ 83%] Built target cutlass_library_gemm_sm75_s4_i8832gemm_s4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_s8_i8816gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm75_s8_i8816gemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_u4_i8832gemm_u4.so [ 83%] Built target cutlass_library_gemm_sm75_u4_i8832gemm_u4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm75_u8_i8816gemm_u8.so [ 83%] Built target cutlass_library_gemm_sm75_u8_i8816gemm_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_bf16_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_planar_complex_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_s8_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16816gemm_u8_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so [ 83%] Built target cutlass_library_gemm_sm80_bf16_s16832spgemm_bf16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_c1688gemm.so [ 83%] Built target cutlass_library_gemm_sm80_c1688gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_c1688tf32gemm.so [ 83%] Built target cutlass_library_gemm_sm80_c1688tf32gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_cgemm.so [ 83%] Built target cutlass_library_gemm_sm80_cgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_d884gemm.so [ 83%] Built target cutlass_library_gemm_sm80_d884gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_dgemm.so [ 83%] Built target cutlass_library_gemm_sm80_dgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_f16_u8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_array_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_planar_complex_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_s8_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16816gemm_u8_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_f16_s16832spgemm_f16.so [ 83%] Built target cutlass_library_gemm_sm80_f16_s16832spgemm_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_gz884gemm.so [ 83%] Built target cutlass_library_gemm_sm80_gz884gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_grouped.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_grouped [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_planar_complex.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_planar_complex_array [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16816gemm_s8_f16.so [ 83%] Built target cutlass_library_gemm_sm80_h16816gemm_s8_f16 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_h16832spgemm.so [ 83%] Built target cutlass_library_gemm_sm80_h16832spgemm [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168128spgemm_s4.so [ 83%] Built target cutlass_library_gemm_sm80_i168128spgemm_s4 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168256andgemm_b1.so [ 83%] Built target cutlass_library_gemm_sm80_i168256andgemm_b1 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i168256xorgemm_b1.so [ 83%] Built target cutlass_library_gemm_sm80_i168256xorgemm_b1 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i16832gemm_s8.so [ 83%] Built target cutlass_library_gemm_sm80_i16832gemm_s8 [ 83%] Linking CUDA shared library libcutlass_gemm_sm80_i16832gemm_u8.so [ 83%] Built target cutlass_library_gemm_sm80_i16832gemm_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864gemm_s4.so [ 84%] Built target cutlass_library_gemm_sm80_i16864gemm_s4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864gemm_u4.so [ 84%] Built target cutlass_library_gemm_sm80_i16864gemm_u4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_i16864spgemm_s8.so [ 84%] Built target cutlass_library_gemm_sm80_i16864spgemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16_s8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_bf16_u8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_bf16_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16_s8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_f16_u8.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_f16_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_grouped_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_grouped_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_array_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_planar_complex_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_s8_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_s8_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_s8_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_u8_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_bf16 [ 84%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so [ 84%] Built target cutlass_library_conv2d_sm75_cf32_cfprop_optimized_cf32 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816gemm_u8_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16816gemm_u8_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16816tf32spgemm.so [ 84%] Built target cutlass_library_gemm_sm80_s16816tf32spgemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16832spgemm_bf16.so [ 84%] Built target cutlass_library_gemm_sm80_s16832spgemm_bf16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s16832spgemm_f16.so [ 84%] Built target cutlass_library_gemm_sm80_s16832spgemm_f16 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688bf16gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s1688bf16gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688f16gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s1688f16gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s1688gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688gemm_tf32.so [ 84%] Built target cutlass_library_gemm_sm80_s1688gemm_tf32 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s1688tf32gemm.so [ 84%] Built target cutlass_library_gemm_sm80_s1688tf32gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s4_i168128spgemm_s4.so [ 84%] Built target cutlass_library_gemm_sm80_s4_i168128spgemm_s4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s4_i16864gemm_s4.so [ 84%] Built target cutlass_library_gemm_sm80_s4_i16864gemm_s4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s8_i16832gemm_s8.so [ 84%] Built target cutlass_library_symm_sm90_z1684hemm_objs [ 84%] Built target cutlass_library_gemm_sm80_s8_i16832gemm_s8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_s8_i16864spgemm_s8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_sgemm.so [ 84%] Built target cutlass_library_gemm_sm80_s8_i16864spgemm_s8 [ 84%] Built target cutlass_library_gemm_sm80_sgemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_u4_i16864gemm_u4.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so [ 84%] Built target cutlass_library_gemm_sm80_tf32_s1688gemm_tf32 [ 84%] Built target cutlass_library_gemm_sm80_u4_i16864gemm_u4 [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_u8_i16832gemm_u8.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm80_z884gemm.so [ 84%] Built target cutlass_library_gemm_sm80_u8_i16832gemm_u8 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so [ 84%] Built target cutlass_library_gemm_sm80_z884gemm [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2 [ 84%] Built target cutlass_library_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3 [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e4m3.so [ 84%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so [ 84%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2 [ 85%] Built target cutlass_library_gemm_sm89_s16832gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3 [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e4m3_e5m2 [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so [ 85%] Built target cutlass_library_gemm_sm89_s16864spgemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x16gemm_bf16 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_d1684gemm.so [ 85%] Built target cutlass_library_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so [ 85%] Built target cutlass_library_gemm_sm90_d1684gemm [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x16gemm_f16 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2 [ 85%] Built target cutlass_library_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3 [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_gz1684gemm.so [ 85%] Linking CUDA shared library libcutlass_gemm_sm90_h64x128x16gemm.so [ 85%] Built target cutlass_library_gemm_sm90_gz1684gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_i64x128x32gemm_s8.so [ 86%] Built target cutlass_library_gemm_sm90_h64x128x16gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_i64x128x32gemm_u8.so [ 86%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_s8 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x16gemm_bf16.so [ 86%] Built target cutlass_library_gemm_sm90_i64x128x32gemm_u8 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x16gemm_f16.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_bf16 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x16gemm_f16 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e4m3_e5m2 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x8gemm_tf32.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x32gemm_e5m2_e4m3 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s64x128x8tf32gemm.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x8gemm_tf32 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so [ 86%] Built target cutlass_library_gemm_sm90_s64x128x8tf32gemm [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so [ 86%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_s8 [ 86%] Built target cutlass_library_gemm_sm90_s8_i64x128x32gemm_u8 [ 86%] Linking CUDA shared library libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so [ 87%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_s8 [ 87%] Built target cutlass_library_gemm_sm90_void_i64x128x32gemm_u8 [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so [ 87%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so [ 87%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_bf16 [ 87%] Built target cutlass_library_gemm_sm90_void_s64x128x16gemm_f16 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3 [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2 [ 88%] Linking CUDA shared library libcutlass_gemm_sm90_z1684gemm.so [ 88%] Built target cutlass_library_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so [ 88%] Built target cutlass_library_gemm_sm90_z1684gemm [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cdgrad_optimized_cf32 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cfprop_optimized_cf32 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_sdgrad_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_cf32_cwgrad_optimized_cf32 [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_sfprop_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_sdgrad_optimized [ 88%] Linking CUDA shared library libcutlass_conv2d_sm50_swgrad_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_sfprop_optimized [ 88%] Linking CUDA shared library libcutlass_conv2d_sm60_hfprop_optimized.so [ 88%] Built target cutlass_library_conv2d_sm50_swgrad_optimized [ 88%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so [ 88%] Built target cutlass_library_conv2d_sm60_hfprop_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884fprop_optimized_f16 [ 89%] Built target cutlass_library_conv2d_sm70_f16_s884wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884dgrad_optimized.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884fprop_optimized.so [ 89%] Built target cutlass_library_conv2d_sm70_h884dgrad_optimized [ 89%] Built target cutlass_library_conv2d_sm70_h884fprop_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_h884wgrad_optimized.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_h884wgrad_optimized [ 89%] Built target cutlass_library_conv2d_sm70_s884dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884fprop_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm70_s884fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so [ 89%] Built target cutlass_library_conv2d_sm70_s884wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so [ 89%] Built target cutlass_library_conv2d_sm75_cf32_cdgrad_optimized_cf32 [ 89%] Built target cutlass_library_conv2d_sm75_cf32_cwgrad_optimized_cf32 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_few_channels_f16 [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_fixed_channels_f16 [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688dgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm75_f16_s1688wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_few_channels.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688dgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_fixed_channels [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_few_channels [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688fprop_optimized.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_h1688wgrad_optimized.so [ 89%] Built target cutlass_library_conv2d_sm75_h1688fprop_optimized [ 89%] Built target cutlass_library_conv2d_sm75_h1688wgrad_optimized [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_u8 [ 89%] Built target cutlass_library_conv2d_sm75_i8816fprop_optimized_s8 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so [ 89%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_u4 [ 89%] Built target cutlass_library_conv2d_sm75_i8832fprop_optimized_s4 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_few_channels_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688dgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_fixed_channels_f16 [ 89%] Built target cutlass_library_conv2d_sm75_s1688fprop_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so [ 89%] Built target cutlass_library_conv2d_sm75_s1688wgrad_optimized_f16 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_s4_i8832fprop_optimized_s4 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_few_channels_s8 [ 89%] Linking CUDA shared library libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so [ 89%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_fixed_channels_s8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so [ 90%] Built target cutlass_library_conv2d_sm75_s8_i8816fprop_optimized_s8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so [ 90%] Built target cutlass_library_conv2d_sm75_u4_i8832fprop_optimized_u4 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_few_channels_u8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_fixed_channels_u8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so [ 90%] Built target cutlass_library_conv2d_sm75_u8_i8816fprop_optimized_u8 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816dgrad_optimized_bf16 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16 [ 90%] Linking CUDA shared library libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so [ 90%] Built target cutlass_library_conv2d_sm80_bf16_s16816fprop_optimized_bf16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_bf16_s16816wgrad_optimized_bf16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816dgrad_optimized_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_fixed_channels_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816fprop_optimized_f16 [ 91%] Built target cutlass_library_conv2d_sm80_f16_s16816wgrad_optimized_f16 [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816dgrad_optimized.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so [ 91%] Built target cutlass_library_conv2d_sm80_h16816fprop_fixed_channels [ 91%] Built target cutlass_library_conv2d_sm80_h16816dgrad_optimized [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816wgrad_optimized.so [ 91%] Linking CUDA shared library libcutlass_conv2d_sm80_h16816fprop_optimized.so [ 91%] Built target cutlass_library_conv2d_sm80_h16816wgrad_optimized [ 91%] Built target cutlass_library_conv2d_sm80_h16816fprop_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so [ 92%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_s8 [ 92%] Built target cutlass_library_conv2d_sm80_i16832fprop_optimized_u8 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so [ 92%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_u4 [ 92%] Built target cutlass_library_conv2d_sm80_i16864fprop_optimized_s4 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816dgrad_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_bf16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_fixed_channels_f16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816fprop_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_bf16 [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s16816wgrad_optimized_f16 [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16dgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16fprop_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688bf16wgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688dgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized_tf32 [ 92%] Built target cutlass_library_conv2d_sm80_s1688dgrad_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16fprop_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16dgrad_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16fprop_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688fprop_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688f16wgrad_optimized [ 92%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so [ 92%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so [ 92%] Built target cutlass_library_conv2d_sm80_s1688fprop_optimized_tf32 [ 92%] Built target cutlass_library_conv2d_sm80_s1688tf32dgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688tf32fprop_optimized [ 93%] Built target cutlass_library_conv2d_sm80_s1688tf32wgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688wgrad_optimized.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized [ 93%] Built target cutlass_library_conv2d_sm80_s1688wgrad_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so [ 93%] Built target cutlass_library_conv2d_sm80_s4_i16864fprop_optimized_s4 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_few_channels_s8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_fixed_channels_s8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_sdgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_s8_i16832fprop_optimized_s8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_sfprop_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_sdgrad_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_swgrad_optimized.so [ 93%] Built target cutlass_library_conv2d_sm80_sfprop_optimized [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_swgrad_optimized [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688dgrad_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688fprop_optimized_tf32 [ 93%] Built target cutlass_library_conv2d_sm80_tf32_s1688wgrad_optimized_tf32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so [ 93%] Built target cutlass_library_conv2d_sm80_u4_i16864fprop_optimized_u4 [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_few_channels_u8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_fixed_channels_u8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so [ 93%] Built target cutlass_library_conv2d_sm80_u8_i16832fprop_optimized_u8 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e4m3 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_fixed_channels_e5m2 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e4m3 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so [ 93%] Built target cutlass_library_conv2d_sm89_s16832fprop_optimized_e5m2 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so [ 93%] Built target cutlass_library_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so [ 93%] Built target cutlass_library_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16 [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32 [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so [ 93%] Linking CUDA shared library libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so [ 93%] Built target cutlass_library_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32 [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so [ 93%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16 [ 93%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so [ 93%] Built target cutlass_library_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16 [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_analytic_f16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816fprop3d_optimized_f16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816dgrad3d_optimized_f16 [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so [ 94%] Built target cutlass_library_conv3d_sm80_f16_s16816wgrad3d_optimized_f16 [ 94%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_analytic [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816fprop3d_optimized.so [ 94%] Built target cutlass_library_conv3d_sm80_h16816dgrad3d_optimized [ 94%] Built target cutlass_library_conv3d_sm80_h16816fprop3d_optimized [ 94%] Linking CUDA shared library libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_bf16 [ 95%] Built target cutlass_library_conv3d_sm80_h16816wgrad3d_optimized [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_analytic_f16 [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_bf16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816dgrad3d_optimized_f16 [ 95%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_bf16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_bf16 [ 95%] Built target cutlass_library_conv3d_sm80_s16816fprop3d_optimized_f16 [ 95%] Linking CUDA shared library libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so [ 95%] Linking CUDA shared library libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so [ 95%] Built target cutlass_library_conv3d_sm80_s16816wgrad3d_optimized_f16 [ 95%] Built target cutlass_library_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16 [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32 [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32 [ 96%] Linking CUDA shared library libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688herk.so [ 96%] Built target cutlass_library_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32 [ 96%] Built target cutlass_library_rank_k_sm80_c1688herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688tf32herk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688tf32herk [ 96%] Built target cutlass_library_rank_k_sm80_c1688syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_c1688tf32syrk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_d884syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_c1688tf32syrk [ 96%] Built target cutlass_library_rank_k_sm80_d884syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_gz884herk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_gz884syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_gz884herk [ 96%] Built target cutlass_library_rank_k_sm80_gz884syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_s1688syrk.so [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_s1688tf32syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_s1688syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_z884herk.so [ 96%] Built target cutlass_library_rank_k_sm80_s1688tf32syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm80_z884syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_z884herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_d1684syrk.so [ 96%] Built target cutlass_library_rank_k_sm80_z884syrk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_gz1684herk.so [ 96%] Built target cutlass_library_rank_k_sm90_d1684syrk [ 96%] Built target cutlass_library_rank_k_sm90_gz1684herk [ 96%] Linking CUDA shared library libcutlass_rank_k_sm90_gz1684syrk.so [ 97%] Linking CUDA shared library libcutlass_rank_k_sm90_z1684herk.so [ 97%] Built target cutlass_library_rank_k_sm90_gz1684syrk [ 97%] Built target cutlass_library_rank_k_sm90_z1684herk [ 97%] Linking CUDA shared library libcutlass_rank_k_sm90_z1684syrk.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688her2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688her2k [ 97%] Built target cutlass_library_rank_k_sm90_z1684syrk [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688tf32her2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_c1688tf32her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_c1688tf32syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_d884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_c1688tf32syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_d884syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_gz884her2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_gz884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_gz884syr2k [ 97%] Built target cutlass_library_rank_2k_sm80_gz884her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_s1688syr2k.so [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_s1688tf32syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_s1688syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_z884her2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_s1688tf32syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm80_z884syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_z884her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_d1684syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm80_z884syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_gz1684her2k.so [ 97%] Built target cutlass_library_rank_2k_sm90_d1684syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_gz1684syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm90_gz1684her2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_z1684her2k.so [ 97%] Built target cutlass_library_rank_2k_sm90_gz1684syr2k [ 97%] Linking CUDA shared library libcutlass_rank_2k_sm90_z1684syr2k.so [ 97%] Built target cutlass_library_rank_2k_sm90_z1684her2k [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_c1688tf32trmm.so [ 97%] Built target cutlass_library_rank_2k_sm90_z1684syr2k [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_c1688trmm.so [ 97%] Built target cutlass_library_trmm_sm80_c1688tf32trmm [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_d884trmm.so [ 97%] Built target cutlass_library_trmm_sm80_c1688trmm [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_gz884trmm.so [ 97%] Built target cutlass_library_trmm_sm80_d884trmm [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_s1688tf32trmm.so [ 97%] Built target cutlass_library_trmm_sm80_gz884trmm [ 97%] Linking CUDA shared library libcutlass_trmm_sm80_s1688trmm.so [ 97%] Built target cutlass_library_trmm_sm80_s1688tf32trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm80_z884trmm.so [ 98%] Built target cutlass_library_trmm_sm80_s1688trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_d1684trmm.so [ 98%] Built target cutlass_library_trmm_sm80_z884trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_gz1684trmm.so [ 98%] Built target cutlass_library_trmm_sm90_d1684trmm [ 98%] Built target cutlass_library_trmm_sm90_gz1684trmm [ 98%] Linking CUDA shared library libcutlass_trmm_sm90_z1684trmm.so [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688hemm.so [ 98%] Built target cutlass_library_trmm_sm90_z1684trmm [ 98%] Built target cutlass_library_symm_sm80_c1688hemm [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688symm.so [ 98%] Linking CUDA shared library libcutlass_symm_sm80_c1688tf32hemm.so [ 98%] Built target cutlass_library_symm_sm80_c1688tf32hemm [ 98%] Built target cutlass_library_symm_sm80_c1688symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_c1688tf32symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_d884symm.so [ 99%] Built target cutlass_library_symm_sm80_d884symm [ 99%] Built target cutlass_library_symm_sm80_c1688tf32symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_gz884symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_gz884hemm.so [ 99%] Built target cutlass_library_symm_sm80_gz884hemm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_s1688symm.so [ 99%] Built target cutlass_library_symm_sm80_gz884symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_s1688tf32symm.so [ 99%] Built target cutlass_library_symm_sm80_s1688tf32symm [ 99%] Built target cutlass_library_symm_sm80_s1688symm [ 99%] Linking CUDA shared library libcutlass_symm_sm80_z884symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm80_z884hemm.so [ 99%] Built target cutlass_library_symm_sm80_z884hemm [ 99%] Built target cutlass_library_symm_sm80_z884symm [ 99%] Linking CUDA shared library libcutlass_symm_sm90_gz1684hemm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm90_d1684symm.so [ 99%] Built target cutlass_library_symm_sm90_d1684symm [ 99%] Built target cutlass_library_symm_sm90_gz1684hemm [ 99%] Linking CUDA shared library libcutlass_symm_sm90_gz1684symm.so [ 99%] Linking CUDA shared library libcutlass_symm_sm90_z1684hemm.so [ 99%] Built target cutlass_library_symm_sm90_gz1684symm [ 99%] Linking CUDA static library libcutlass_symm_sm90_z1684hemm.a [ 99%] Built target cutlass_library_symm_sm90_z1684hemm [ 99%] Linking CXX shared library libcutlass.so [ 99%] Built target cutlass_library_symm_sm90_z1684hemm_static [ 99%] Linking CXX static library libcutlass.a [ 99%] Built target cutlass_library_static [ 99%] Built target cutlass_library [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/main.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cutlass_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/options.cu.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/performance_report.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/enumerated_types.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gpu_timer.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_allocation.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/device_context.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cublas_helpers.cu.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/cudnn_helpers.cpp.o [ 99%] Building CXX object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/problem_space.cpp.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/gemm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/rank_k_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/rank_2k_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/trmm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/symm_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv2d_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/conv3d_operation_profiler.cu.o [ 99%] Building CUDA object tools/profiler/CMakeFiles/cutlass_profiler.dir/src/sparse_gemm_operation_profiler.cu.o [100%] Linking CXX executable cutlass_profiler [100%] Built target cutlass_profiler + popd + exit 0 ~/build/BUILD/cutlass Executing(%install): /bin/sh -e /var/tmp/rpm-tmp.ftbxQO + umask 022 + cd /builddir/build/BUILD + '[' /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le '!=' / ']' + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le ++ dirname /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le + mkdir -p /builddir/build/BUILDROOT + mkdir /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le + cd cutlass + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le + pushd build ~/build/BUILD/cutlass/build ~/build/BUILD/cutlass + DESTDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le + /usr/bin/cmake --install . -- Install configuration: "Release" -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/axpby.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/clear.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/cooperative_copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/cooperative_gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/fill.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/functional.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/prefer.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/prefetch.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/tensor_algorithms.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/algorithm/tuple_algorithms.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/cluster_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/copy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/copy_sm50.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/copy_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/copy_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/copy_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/copy_sm90_desc.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/copy_sm90_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma_sm61.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma_sm70.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma_sm90_desc.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/mma_sm90_gmma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/arch/util.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_atom.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits_sm50.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits_sm90_im2col.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits_sm90_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/copy_traits_sm90_tma_swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_atom.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_traits_sm61.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_traits_sm70.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_traits_sm75.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_traits_sm80.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_traits_sm90.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/atom/mma_traits_sm90_gmma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/config.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/alignment.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/array.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/array_aligned.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/array_subbyte.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/bit_field.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/cuda_types.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/container/type_list.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/int_tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/layout_composed.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/arithmetic_tuple.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/complex.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/int.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/integer_sequence.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/integral_constant.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/integral_ratio.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/math.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/numeric_types.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/numeric/real.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/pointer.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/pointer_base.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/pointer_flagged.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/pointer_swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/stride.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/swizzle.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/swizzle_layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/tensor.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/tensor_predicate.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/underscore.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/util -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/util/debug.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/util/print.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cute/util/type_traits.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/aligned_buffer.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/arch.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/barrier.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/cache_operation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/memory.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/memory_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/memory_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm50.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm89.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sm90.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sparse_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/mma_sparse_sm89.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/reg_reconfig.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/simd.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/simd_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/simd_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/wmma_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/wmma_sm72.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/arch/wmma_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/array_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/array_subbyte.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/barrier.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/bfloat16.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/blas3_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/block_striped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/cluster_launch.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/constants.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective/builders/sm90_common.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective/builders/sm90_gmma_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective/collective_conv.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/collective/sm90_implicit_gemm_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/conv2d_problem_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/conv3d_problem_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/convnd_problem_shape.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/device/conv_universal_adapter.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/device/direct_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/device/implicit_gemm_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/conv_universal.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_fprop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_fprop_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_group_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_wgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv2d_wgrad_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv3d_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv3d_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv3d_fprop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv3d_fprop_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_conv3d_wgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_deconv2d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_deconv2d_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_deconv3d.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_deconv3d_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/default_depthwise_fprop.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/direct_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/implicit_gemm_convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_strided_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/implicit_gemm_convolution_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/thread/depthwise_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_dgrad_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_dgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_dgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_few_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_fixed_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_few_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_fixed_channels.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_fprop_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_wgrad_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_wgrad_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_wgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv2d_wgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_dgrad_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_dgrad_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_dgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_dgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_fprop_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_fprop_filter_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_fprop_filter_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_wgrad_activation_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_wgrad_activation_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_wgrad_output_gradient_tile_access_iterator_analytic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/conv3d_wgrad_output_gradient_tile_access_iterator_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_direct_conv_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_fixed_stride_dilation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_fprop_activation_tile_access_iterator_direct_conv_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_fprop_direct_conv_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_fprop_filter_tile_access_iterator_direct_conv_optimized.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_fprop_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_mma_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/depthwise_mma_core_with_lane_access_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/implicit_gemm_fprop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/implicit_gemm_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/implicit_gemm_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/implicit_gemm_wgrad_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/predicated_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/predicated_scale_bias_vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/threadblock/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/warp/mma_depthwise_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/warp/mma_depthwise_simt_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/conv/warp/scale_bias_relu_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/core_io.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/cuda_host_adapter.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/cutlass.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/detail -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/detail/collective.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/detail/dependent_false.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/detail/helper_macros.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/detail/layout.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/detail/mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/device_kernel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/builders/sm90_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/collective_epilogue.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/default_epilogue.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/default_epilogue_array.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/epilogue_tensor_broadcast.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion/callbacks.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion/operations.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/activation.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/conversion_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/detail.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_bias_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_clamp.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_dgelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_drelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_gelu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_generic.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_generic_with_scaling.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_hardswish.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_leaky_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_relu0.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_residual_block.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_sigmoid.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_silu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_tensor_broadcast.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/linear_combination_with_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/reduction_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/thread/scale_type.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_complex_tensor_op_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_direct_store.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_epilogue_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_thread_map_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_thread_map_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_thread_map_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/default_thread_map_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/direct_store_epilogue_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_base_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_depthwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_direct_store.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_gemm_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_smem_accumulator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_visitor_with_softmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_with_visitor_callbacks.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/epilogue_workspace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/fusion -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/fusion/visitor_2x.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/fusion/visitor_compute.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/fusion/visitor_load.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/fusion/visitor_store.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/fusion/visitors.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/interleaved_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/output_iterator_parameter.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/output_tile_thread_map.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_affine.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_affine_layout_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_blas3.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_direct_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_predicates.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/predicated_tile_iterator_strided_dgrad.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/shared_load_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/shared_load_iterator_mixed.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/threadblock/shared_load_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/fragment_iterator_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/fragment_iterator_gaussian_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/fragment_iterator_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/fragment_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/fragment_iterator_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/fragment_iterator_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/simt_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/tile_iterator_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/tile_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/tile_iterator_tensor_op_mixed.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/tile_iterator_volta_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/tile_iterator_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/volta_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/epilogue/warp/wmma_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/fast_math.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/float8.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/floating_point_nvrtc.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/builders -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/builders/sm90_common.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/collective_builder.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/collective_mma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/fp8_accumulation.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm70_mma_twostage.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm80_mma_multistage.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_rs_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/base_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/default_gemm_configuration.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_batched.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_sparse_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_sparse_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_universal_adapter.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_universal_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_universal_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_universal_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_universal_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/rank_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/device/trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/dispatch_policy.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/gemm_enumerated_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/group_array_problem_shape.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_grouped_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_planar_complex_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_sparse_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_sparse_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_streamk_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_universal_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_with_broadcast.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemm_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_rank_2k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_rank_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_rank_k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_rank_k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_symm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_symm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/default_trmm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/ell_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_batched.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_grouped_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_planar_complex_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_splitk_parallel.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_streamk_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_transpose_operands.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_universal.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_universal_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_universal_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_universal_with_visitor_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_with_fused_epilogue.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemm_with_k_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/gemv_batched_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/params_sparse_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/params_universal_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/rank_2k_grouped.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/rank_2k_grouped_problem_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/rank_2k_transpose_operands.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/rank_2k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/rank_k_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm70_gemm.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_cooperative.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sparse_gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sparse_gemm_with_absmax.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/sparse_gemm_with_visitor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/static_tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/symm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/tile_scheduler.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/tile_scheduler_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/kernel/trmm_universal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/thread/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/thread/mma_sm50.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/thread/mma_sm60.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/thread/mma_sm61.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_ell_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_gemv_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_sparse_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_with_access_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_core_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_layernorm_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_planar_complex_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_planar_complex_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_softmax_mainloop_fusion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_mma_with_reduction.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex_core.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_multistage_mma_complex_core_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_multistage_trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_sparse_mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/default_trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/ell_mma_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/ell_mma_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/gemv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/index_remat.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_blas3_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_layernorm_mainloop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_planar_complex_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_planar_complex_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_planar_complex_pipelined.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_singlestage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_softmax_mainloop_fusion_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_sparse_base.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_sparse_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/mma_with_reduction_multistage.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/threadblock/threadblock_swizzle_streamk.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/default_mma_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/default_mma_sparse_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/default_mma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/default_mma_with_reduction_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/default_mma_wmma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/layernorm_scale_bias_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_complex_tensor_op_fast_f32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_complex_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_gaussian_complex_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_gaussian_complex_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_mixed_input_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_simt.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_simt_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_simt_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_sparse_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_fast_f32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_fragment_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_policy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_sparse.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_tensor_op_wmma.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/mma_with_reduction_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/scale_bias_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/softmax_scale_bias_transform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm/warp/tile_iterator_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/gemm_coord.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/half.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/integer_subbyte.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/kernel_hardware_info.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/kernel_hardware_info.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/kernel_launch.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/layout.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/permute.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/tensor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/tensor_op_multiplicand_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/tensor_op_multiplicand_sm75.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/tensor_op_multiplicand_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/layout/vector.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/matrix_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/matrix_shape.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/numeric_conversion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/numeric_size.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/numeric_types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/pipeline -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/pipeline/pipeline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/pipeline/sm90_pipeline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/pitch_linear_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/platform -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/platform/platform.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/predicate_vector.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/quaternion.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/real.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/device/reduce_split_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/device/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/device/tensor_reduce_affine_contiguous.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/device/tensor_reduce_affine_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/kernel/reduce_softmax_final.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/kernel/reduce_split_k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/kernel/tensor_reduce_affine_contiguous.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/kernel/tensor_reduce_affine_strided.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/thread/reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/thread/reduction_operators.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/reduction/threadblock_swizzle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/relatively_equal.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/semaphore.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/subbyte_reference.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/tensor_coord.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/tensor_ref.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/tensor_ref_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/tensor_view.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/tensor_view_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/tfloat32.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/thread/matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/trace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/collective -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/collective/sm90_wgmma_transpose.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/pitch_linear_thread_map.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/thread/transpose.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/thread/unary_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/ell_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/ell_predicated_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/ell_predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_scale_bias_vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_params.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_tile_access_iterator_triangular_matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_tile_iterator_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_tile_iterator_triangular_matrix.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/predicated_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_scale_bias_vector_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_pitch_linear_direct_conv.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_access_iterator_tensor_op_sm80.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear_2dthreadtile.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_iterator_tensor_op.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/regular_tile_iterator_tensor_op_sm70.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/threadblock/vector_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/warp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/transform/warp/vector_fragment_iterator.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/uint128.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/version.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/wmma_array.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/workspace.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/functional.h.fp16~ -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/functional.h -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/cutlass/version_extended.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/test/cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/test/cutlass/bin -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/test/cutlass/lib64 -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/test/cutlass/ctest -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/ -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/GPU_Clock.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/command_line.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/cublas_wrappers.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/debug.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_dump.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_groupnorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_layernorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_memory.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_nchw_to_nhwc.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_nhwc_padding.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_nhwc_pooling.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_nhwc_to_nchw.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_rmsnorm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/device_utils.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/distribution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/exceptions.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/gett_commandline.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/helper_cuda.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/host_reorder.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/host_tensor.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/host_tensor_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/host_uncompress.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/index_sequence.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/packed_stride.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/print_error.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/detail -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/detail/inner_product.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/detail/linear_to_coordinate.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/gett.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/kernel -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/kernel/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/kernel/tensor_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/kernel/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/tensor_compare.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/tensor_fill.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/tensor_relu.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/thread -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/device/thread/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/conv.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/convolution.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/error_metrics.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/gemm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/gemm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/gemm_planar_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/gett.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/rank_2k.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/rank_2k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/rank_k_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/symm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/symm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_compare.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_compare.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_copy.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_elementwise.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_fill.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_fill.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_foreach.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_norm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_reduce.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/tensor_reduce.hpp -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/trmm.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/reference/host/trmm_complex.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/tensor_view_io.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/util/type_traits.h -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include/ -- Up-to-date: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/arch_mappings.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/descriptions.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/handle.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/library.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/manifest.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/operation_table.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/singleton.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/types.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/include//cutlass/library/util.h -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_cgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_cgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_dgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_dgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_sgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_sgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm60_hgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm60_hgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm61_igemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm61_igemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_c1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_c1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_cgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_cgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_d884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_d884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_dgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_dgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_gz884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_gz884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_sgemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_sgemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_z884gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_z884gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_d1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_d1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_z1684gemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_z1684gemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_d884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_d884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_gz884herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_gz884herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_z884herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_z884herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_z884syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_z884syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_z1684herk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_z1684herk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_c1688trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_c1688trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_d884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_d884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_gz884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_gz884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_s1688trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_s1688trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_z884trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_z884trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_d1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_d1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_z1684trmm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_z1684trmm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_d884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_d884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_gz884hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_gz884hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_gz884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_gz884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_s1688symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_s1688symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_z884hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_z884hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_z884symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_z884symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_d1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_d1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_gz1684hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_gz1684hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_gz1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_gz1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_z1684hemm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_z1684hemm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_z1684symm.so -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_z1684symm.a -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/info/cutlass/generated_kernels.txt -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/bin/cutlass_profiler -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/test/cutlass/ctest/ctest_profiler/CTestTestfile.ctest_profiler.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/test/cutlass/CTestTestfile.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassConfig.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassConfigVersion.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassTargets.cmake -- Installing: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/cmake/NvidiaCutlass/NvidiaCutlassTargets-release.cmake ~/build/BUILD/cutlass + popd + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/test + rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/info + set +x Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/bin/cutlass_profiler Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_sdgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_sfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm50_swgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm60_hfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_h884wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_few_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_h1688wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_h16816wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_sdgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_sfprop_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_swgrad_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816fprop3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_cgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_dgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm50_sgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm60_hgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm61_igemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm61_s8_igemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_h884gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i88128xorgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8816gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8816gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8832gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_i8832gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s4_i8832gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_s8_i8816gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_u4_i8832gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm75_u8_i8816gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_c1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_c1688tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_cgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_d884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_dgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_f16_s16832spgemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_gz884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_grouped.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_h16832spgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168128spgemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168256andgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i168256xorgemm_b1.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16832gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16832gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_i16864spgemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_bf16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_f16_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_grouped_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_s8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816gemm_u8_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16816tf32spgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s16832spgemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688bf16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688f16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s1688tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s4_i168128spgemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s4_i16864gemm_s4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s8_i16832gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_s8_i16864spgemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_sgemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_u4_i16864gemm_u4.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_u8_i16832gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm80_z884gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_d1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_gz1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_h64x128x16gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x8gemm_tf32.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s64x128x8tf32gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_gemm_sm90_z1684gemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_c1688tf32syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_d884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_gz884her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_gz884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_s1688syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_s1688tf32syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_z884her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm80_z884syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_d1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_gz1684her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_gz1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_z1684her2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_2k_sm90_z1684syr2k.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688tf32herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_c1688tf32syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_d884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_gz884herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_gz884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_s1688syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_s1688tf32syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_z884herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm80_z884syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_d1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_gz1684herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_gz1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_z1684herk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_rank_k_sm90_z1684syrk.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688tf32hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_c1688tf32symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_d884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_gz884hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_gz884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_s1688symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_s1688tf32symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_z884hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm80_z884symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_d1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_gz1684hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_gz1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_z1684hemm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_symm_sm90_z1684symm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_c1688tf32trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_c1688trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_d884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_gz884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_s1688tf32trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_s1688trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm80_z884trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_d1684trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_gz1684trmm.so Stripping: /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/lib64/libcutlass_trmm_sm90_z1684trmm.so + /usr/lib/rpm/check-buildroot + /usr/lib/rpm/redhat/brp-ldconfig /sbin/ldconfig: Warning: ignoring configuration file that cannot be opened: /etc/ld.so.conf: No such file or directory + /usr/lib/rpm/brp-compress + /usr/lib/rpm/brp-strip /usr/bin/strip + /usr/lib/rpm/brp-strip-comment-note /usr/bin/strip /usr/bin/objdump + /usr/lib/rpm/brp-strip-static-archive /usr/bin/strip + /usr/lib/rpm/brp-python-bytecompile '' 1 + /usr/lib/rpm/brp-python-hardlink + PYTHON3=/usr/bin/python3.6 + /usr/lib/rpm/redhat/brp-mangle-shebangs Processing files: cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le Executing(%doc): /bin/sh -e /var/tmp/rpm-tmp.1rwEbj + umask 022 + cd /builddir/build/BUILD + cd cutlass + DOCDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/doc/cutlass + export LC_ALL=C + LC_ALL=C + export DOCDIR + /usr/bin/mkdir -p /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/doc/cutlass + cp -pr README.md /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/doc/cutlass + cp -pr docs /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/doc/cutlass + exit 0 Executing(%license): /bin/sh -e /var/tmp/rpm-tmp.LOTuwZ + umask 022 + cd /builddir/build/BUILD + cd cutlass + LICENSEDIR=/builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/licenses/cutlass + export LC_ALL=C + LC_ALL=C + export LICENSEDIR + /usr/bin/mkdir -p /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/licenses/cutlass + cp -pr LICENSE.txt /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le/usr/share/licenses/cutlass + exit 0 Provides: cutlass = 3.5.0-20240411.1.cu12_4.el8 cutlass(ppc-64) = 3.5.0-20240411.1.cu12_4.el8 libcutlass.so()(64bit) libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm50_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm50_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm60_hfprop_optimized.so()(64bit) libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_h884dgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_h884fprop_optimized.so()(64bit) libcutlass_conv2d_sm70_h884wgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_h1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_few_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_h16816dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm80_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so()(64bit) libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816fprop3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_gemm_sm50_cgemm.so()(64bit) libcutlass_gemm_sm50_dgemm.so()(64bit) libcutlass_gemm_sm50_sgemm.so()(64bit) libcutlass_gemm_sm60_hgemm.so()(64bit) libcutlass_gemm_sm61_igemm_s8.so()(64bit) libcutlass_gemm_sm61_s8_igemm_s8.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm70_h884gemm.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm70_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_h1688gemm.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm75_i88128xorgemm_b1.so()(64bit) libcutlass_gemm_sm75_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm75_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_s4_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_s8_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_u4_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_u8_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_c1688gemm.so()(64bit) libcutlass_gemm_sm80_c1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_cgemm.so()(64bit) libcutlass_gemm_sm80_d884gemm.so()(64bit) libcutlass_gemm_sm80_dgemm.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_gz884gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm_grouped.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm80_h16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_h16832spgemm.so()(64bit) libcutlass_gemm_sm80_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_i168256andgemm_b1.so()(64bit) libcutlass_gemm_sm80_i168256xorgemm_b1.so()(64bit) libcutlass_gemm_sm80_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_s16816tf32spgemm.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_s1688bf16gemm.so()(64bit) libcutlass_gemm_sm80_s1688f16gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_s1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_s4_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_s4_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_s8_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_s8_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_sgemm.so()(64bit) libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_u4_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_u8_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_z884gemm.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_d1684gemm.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_gz1684gemm.so()(64bit) libcutlass_gemm_sm90_h64x128x16gemm.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x8gemm_tf32.so()(64bit) libcutlass_gemm_sm90_s64x128x8tf32gemm.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_z1684gemm.so()(64bit) libcutlass_rank_2k_sm80_c1688her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_d884syr2k.so()(64bit) libcutlass_rank_2k_sm80_gz884her2k.so()(64bit) libcutlass_rank_2k_sm80_gz884syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_z884her2k.so()(64bit) libcutlass_rank_2k_sm80_z884syr2k.so()(64bit) libcutlass_rank_2k_sm90_d1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684her2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_z1684her2k.so()(64bit) libcutlass_rank_2k_sm90_z1684syr2k.so()(64bit) libcutlass_rank_k_sm80_c1688herk.so()(64bit) libcutlass_rank_k_sm80_c1688syrk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32herk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_d884syrk.so()(64bit) libcutlass_rank_k_sm80_gz884herk.so()(64bit) libcutlass_rank_k_sm80_gz884syrk.so()(64bit) libcutlass_rank_k_sm80_s1688syrk.so()(64bit) libcutlass_rank_k_sm80_s1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_z884herk.so()(64bit) libcutlass_rank_k_sm80_z884syrk.so()(64bit) libcutlass_rank_k_sm90_d1684syrk.so()(64bit) libcutlass_rank_k_sm90_gz1684herk.so()(64bit) libcutlass_rank_k_sm90_gz1684syrk.so()(64bit) libcutlass_rank_k_sm90_z1684herk.so()(64bit) libcutlass_rank_k_sm90_z1684syrk.so()(64bit) libcutlass_symm_sm80_c1688hemm.so()(64bit) libcutlass_symm_sm80_c1688symm.so()(64bit) libcutlass_symm_sm80_c1688tf32hemm.so()(64bit) libcutlass_symm_sm80_c1688tf32symm.so()(64bit) libcutlass_symm_sm80_d884symm.so()(64bit) libcutlass_symm_sm80_gz884hemm.so()(64bit) libcutlass_symm_sm80_gz884symm.so()(64bit) libcutlass_symm_sm80_s1688symm.so()(64bit) libcutlass_symm_sm80_s1688tf32symm.so()(64bit) libcutlass_symm_sm80_z884hemm.so()(64bit) libcutlass_symm_sm80_z884symm.so()(64bit) libcutlass_symm_sm90_d1684symm.so()(64bit) libcutlass_symm_sm90_gz1684hemm.so()(64bit) libcutlass_symm_sm90_gz1684symm.so()(64bit) libcutlass_symm_sm90_z1684hemm.so()(64bit) libcutlass_symm_sm90_z1684symm.so()(64bit) libcutlass_trmm_sm80_c1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_c1688trmm.so()(64bit) libcutlass_trmm_sm80_d884trmm.so()(64bit) libcutlass_trmm_sm80_gz884trmm.so()(64bit) libcutlass_trmm_sm80_s1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_s1688trmm.so()(64bit) libcutlass_trmm_sm80_z884trmm.so()(64bit) libcutlass_trmm_sm90_d1684trmm.so()(64bit) libcutlass_trmm_sm90_gz1684trmm.so()(64bit) libcutlass_trmm_sm90_z1684trmm.so()(64bit) Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Requires: ld64.so.2()(64bit) ld64.so.2(GLIBC_2.22)(64bit) libc.so.6()(64bit) libc.so.6(GLIBC_2.17)(64bit) libcuda.so.1()(64bit) libcudart.so.12()(64bit) libcudart.so.12(libcudart.so.12)(64bit) libcutlass.so()(64bit) libcutlass_conv2d_sm50_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm50_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm50_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm50_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm60_hfprop_optimized.so()(64bit) libcutlass_conv2d_sm70_f16_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_f16_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_h884dgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_h884fprop_optimized.so()(64bit) libcutlass_conv2d_sm70_h884wgrad_optimized.so()(64bit) libcutlass_conv2d_sm70_s884dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm70_s884wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_cf32_cdgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cfprop_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_cf32_cwgrad_optimized_cf32.so()(64bit) libcutlass_conv2d_sm75_f16_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_f16_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_h1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_few_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm75_h1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm75_h1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_s1688dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_few_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm75_s1688fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s1688wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm75_s4_i8832fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm75_s8_i8816fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm75_u4_i8832fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm75_u8_i8816fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_bf16_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_f16_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_h16816dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_fixed_channels.so()(64bit) libcutlass_conv2d_sm80_h16816fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_h16816wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816dgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_fixed_channels_f16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816fprop_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_bf16.so()(64bit) libcutlass_conv2d_sm80_s16816wgrad_optimized_f16.so()(64bit) libcutlass_conv2d_sm80_s1688bf16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688bf16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688f16dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688f16wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s1688tf32dgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32fprop_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688tf32wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_s4_i16864fprop_optimized_s4.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_few_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_fixed_channels_s8.so()(64bit) libcutlass_conv2d_sm80_s8_i16832fprop_optimized_s8.so()(64bit) libcutlass_conv2d_sm80_sdgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_sfprop_optimized.so()(64bit) libcutlass_conv2d_sm80_swgrad_optimized.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688dgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688fprop_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_tf32_s1688wgrad_optimized_tf32.so()(64bit) libcutlass_conv2d_sm80_u4_i16864fprop_optimized_u4.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_few_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_fixed_channels_u8.so()(64bit) libcutlass_conv2d_sm80_u8_i16832fprop_optimized_u8.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_fixed_channels_e5m2.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e4m3.so()(64bit) libcutlass_conv2d_sm89_s16832fprop_optimized_e5m2.so()(64bit) libcutlass_conv2d_sm90_h64x64x16dgrad_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_h64x64x16fprop_f16nhwc_f16nhwc_f16_f16_f16.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16dgrad_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_bf16nhwc_bf16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv2d_sm90_s64x64x16fprop_f16nhwc_f16nhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_bf16_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_f16_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_analytic.so()(64bit) libcutlass_conv3d_sm80_h16816dgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816fprop3d_optimized.so()(64bit) libcutlass_conv3d_sm80_h16816wgrad3d_optimized.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_analytic_f16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816dgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816fprop3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_bf16.so()(64bit) libcutlass_conv3d_sm80_s16816wgrad3d_optimized_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16dgrad_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_h64x64x16fprop_f16ndhwc_f16ndhwc_f16_f16_f16.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16dgrad_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_bf16ndhwc_bf16ndhwc_f32_f32_f32.so()(64bit) libcutlass_conv3d_sm90_s64x64x16fprop_f16ndhwc_f16ndhwc_f32_f32_f32.so()(64bit) libcutlass_gemm_sm50_cgemm.so()(64bit) libcutlass_gemm_sm50_dgemm.so()(64bit) libcutlass_gemm_sm50_sgemm.so()(64bit) libcutlass_gemm_sm60_hgemm.so()(64bit) libcutlass_gemm_sm61_igemm_s8.so()(64bit) libcutlass_gemm_sm61_s8_igemm_s8.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_f16_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm70_h884gemm.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex.so()(64bit) libcutlass_gemm_sm70_h884gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm70_s884gemm_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm70_s884gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_f16_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_h1688gemm.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex.so()(64bit) libcutlass_gemm_sm75_h1688gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm75_i88128xorgemm_b1.so()(64bit) libcutlass_gemm_sm75_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm75_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_s1688gemm_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm75_s1688gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm75_s4_i8832gemm_s4.so()(64bit) libcutlass_gemm_sm75_s8_i8816gemm_s8.so()(64bit) libcutlass_gemm_sm75_u4_i8832gemm_u4.so()(64bit) libcutlass_gemm_sm75_u8_i8816gemm_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_bf16_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_c1688gemm.so()(64bit) libcutlass_gemm_sm80_c1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_cgemm.so()(64bit) libcutlass_gemm_sm80_d884gemm.so()(64bit) libcutlass_gemm_sm80_dgemm.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_f16_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_gz884gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm.so()(64bit) libcutlass_gemm_sm80_h16816gemm_grouped.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex.so()(64bit) libcutlass_gemm_sm80_h16816gemm_planar_complex_array.so()(64bit) libcutlass_gemm_sm80_h16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_h16832spgemm.so()(64bit) libcutlass_gemm_sm80_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_i168256andgemm_b1.so()(64bit) libcutlass_gemm_sm80_i168256xorgemm_b1.so()(64bit) libcutlass_gemm_sm80_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_bf16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_s8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_f16_u8.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_grouped_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_array_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_planar_complex_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_s8_f16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_bf16.so()(64bit) libcutlass_gemm_sm80_s16816gemm_u8_f16.so()(64bit) libcutlass_gemm_sm80_s16816tf32spgemm.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_bf16.so()(64bit) libcutlass_gemm_sm80_s16832spgemm_f16.so()(64bit) libcutlass_gemm_sm80_s1688bf16gemm.so()(64bit) libcutlass_gemm_sm80_s1688f16gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm.so()(64bit) libcutlass_gemm_sm80_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_s1688tf32gemm.so()(64bit) libcutlass_gemm_sm80_s4_i168128spgemm_s4.so()(64bit) libcutlass_gemm_sm80_s4_i16864gemm_s4.so()(64bit) libcutlass_gemm_sm80_s8_i16832gemm_s8.so()(64bit) libcutlass_gemm_sm80_s8_i16864spgemm_s8.so()(64bit) libcutlass_gemm_sm80_sgemm.so()(64bit) libcutlass_gemm_sm80_tf32_s1688gemm_tf32.so()(64bit) libcutlass_gemm_sm80_u4_i16864gemm_u4.so()(64bit) libcutlass_gemm_sm80_u8_i16832gemm_u8.so()(64bit) libcutlass_gemm_sm80_z884gemm.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832fastaccumgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16832gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864fastaccumspgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2.so()(64bit) libcutlass_gemm_sm89_s16864spgemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_bf16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_d1684gemm.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_f16_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_gz1684gemm.so()(64bit) libcutlass_gemm_sm90_h64x128x16gemm.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_s64x128x8gemm_tf32.so()(64bit) libcutlass_gemm_sm90_s64x128x8tf32gemm.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_s8_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_s8.so()(64bit) libcutlass_gemm_sm90_void_i64x128x32gemm_u8.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_bf16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x16gemm_f16.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e4m3_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2.so()(64bit) libcutlass_gemm_sm90_void_s64x128x32gemm_e5m2_e4m3.so()(64bit) libcutlass_gemm_sm90_z1684gemm.so()(64bit) libcutlass_rank_2k_sm80_c1688her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32her2k.so()(64bit) libcutlass_rank_2k_sm80_c1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_d884syr2k.so()(64bit) libcutlass_rank_2k_sm80_gz884her2k.so()(64bit) libcutlass_rank_2k_sm80_gz884syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688syr2k.so()(64bit) libcutlass_rank_2k_sm80_s1688tf32syr2k.so()(64bit) libcutlass_rank_2k_sm80_z884her2k.so()(64bit) libcutlass_rank_2k_sm80_z884syr2k.so()(64bit) libcutlass_rank_2k_sm90_d1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684her2k.so()(64bit) libcutlass_rank_2k_sm90_gz1684syr2k.so()(64bit) libcutlass_rank_2k_sm90_z1684her2k.so()(64bit) libcutlass_rank_2k_sm90_z1684syr2k.so()(64bit) libcutlass_rank_k_sm80_c1688herk.so()(64bit) libcutlass_rank_k_sm80_c1688syrk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32herk.so()(64bit) libcutlass_rank_k_sm80_c1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_d884syrk.so()(64bit) libcutlass_rank_k_sm80_gz884herk.so()(64bit) libcutlass_rank_k_sm80_gz884syrk.so()(64bit) libcutlass_rank_k_sm80_s1688syrk.so()(64bit) libcutlass_rank_k_sm80_s1688tf32syrk.so()(64bit) libcutlass_rank_k_sm80_z884herk.so()(64bit) libcutlass_rank_k_sm80_z884syrk.so()(64bit) libcutlass_rank_k_sm90_d1684syrk.so()(64bit) libcutlass_rank_k_sm90_gz1684herk.so()(64bit) libcutlass_rank_k_sm90_gz1684syrk.so()(64bit) libcutlass_rank_k_sm90_z1684herk.so()(64bit) libcutlass_rank_k_sm90_z1684syrk.so()(64bit) libcutlass_symm_sm80_c1688hemm.so()(64bit) libcutlass_symm_sm80_c1688symm.so()(64bit) libcutlass_symm_sm80_c1688tf32hemm.so()(64bit) libcutlass_symm_sm80_c1688tf32symm.so()(64bit) libcutlass_symm_sm80_d884symm.so()(64bit) libcutlass_symm_sm80_gz884hemm.so()(64bit) libcutlass_symm_sm80_gz884symm.so()(64bit) libcutlass_symm_sm80_s1688symm.so()(64bit) libcutlass_symm_sm80_s1688tf32symm.so()(64bit) libcutlass_symm_sm80_z884hemm.so()(64bit) libcutlass_symm_sm80_z884symm.so()(64bit) libcutlass_symm_sm90_d1684symm.so()(64bit) libcutlass_symm_sm90_gz1684hemm.so()(64bit) libcutlass_symm_sm90_gz1684symm.so()(64bit) libcutlass_symm_sm90_z1684hemm.so()(64bit) libcutlass_symm_sm90_z1684symm.so()(64bit) libcutlass_trmm_sm80_c1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_c1688trmm.so()(64bit) libcutlass_trmm_sm80_d884trmm.so()(64bit) libcutlass_trmm_sm80_gz884trmm.so()(64bit) libcutlass_trmm_sm80_s1688tf32trmm.so()(64bit) libcutlass_trmm_sm80_s1688trmm.so()(64bit) libcutlass_trmm_sm80_z884trmm.so()(64bit) libcutlass_trmm_sm90_d1684trmm.so()(64bit) libcutlass_trmm_sm90_gz1684trmm.so()(64bit) libcutlass_trmm_sm90_z1684trmm.so()(64bit) libgcc_s.so.1()(64bit) libgcc_s.so.1(GCC_3.0)(64bit) libgcc_s.so.1(GCC_3.4.4)(64bit) libm.so.6()(64bit) libm.so.6(GLIBC_2.17)(64bit) libstdc++.so.6()(64bit) libstdc++.so.6(CXXABI_1.3)(64bit) libstdc++.so.6(CXXABI_1.3.5)(64bit) libstdc++.so.6(CXXABI_1.3.9)(64bit) libstdc++.so.6(GLIBCXX_3.4)(64bit) libstdc++.so.6(GLIBCXX_3.4.11)(64bit) libstdc++.so.6(GLIBCXX_3.4.15)(64bit) libstdc++.so.6(GLIBCXX_3.4.18)(64bit) libstdc++.so.6(GLIBCXX_3.4.20)(64bit) libstdc++.so.6(GLIBCXX_3.4.21)(64bit) libstdc++.so.6(GLIBCXX_3.4.5)(64bit) libstdc++.so.6(GLIBCXX_3.4.9)(64bit) rtld(GNU_HASH) Processing files: cutlass-devel-3.5.0-20240411.1.cu12_4.el8.ppc64le Provides: cmake(NvidiaCutlass) = 3.5.0 cmake(nvidiacutlass) = 3.5.0 cutlass-devel = 3.5.0-20240411.1.cu12_4.el8 cutlass-devel(ppc-64) = 3.5.0-20240411.1.cu12_4.el8 Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Requires: cmake-filesystem(ppc-64) Processing files: cutlass-static-3.5.0-20240411.1.cu12_4.el8.ppc64le Provides: cutlass-static = 3.5.0-20240411.1.cu12_4.el8 cutlass-static(ppc-64) = 3.5.0-20240411.1.cu12_4.el8 Requires(rpmlib): rpmlib(CompressedFileNames) <= 3.0.4-1 rpmlib(FileDigests) <= 4.6.0-1 rpmlib(PayloadFilesHavePrefix) <= 4.0-1 Checking for unpackaged file(s): /usr/lib/rpm/check-files /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le Wrote: /builddir/build/RPMS/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le.rpm Wrote: /builddir/build/RPMS/cutlass-devel-3.5.0-20240411.1.cu12_4.el8.ppc64le.rpm Wrote: /builddir/build/RPMS/cutlass-static-3.5.0-20240411.1.cu12_4.el8.ppc64le.rpm Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.McziEi + umask 022 + cd /builddir/build/BUILD + cd cutlass + /usr/bin/rm -rf /builddir/build/BUILDROOT/cutlass-3.5.0-20240411.1.cu12_4.el8.ppc64le + exit 0 Finish: rpmbuild cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm Finish: build phase for cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm INFO: chroot_scan: 3 files copied to /var/lib/copr-rpmbuild/results/chroot_scan INFO: /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root/var/log/dnf.log /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root/var/log/dnf.librepo.log /var/lib/mock/rhel+epel-8-ppc64le-1713469148.550083/root/var/log/dnf.rpm.log INFO: Done(/var/lib/copr-rpmbuild/results/cutlass-3.5.0-20240411.1.cu12_4.el8.src.rpm) Config(child) 917 minutes 17 seconds INFO: Results and/or logs in: /var/lib/copr-rpmbuild/results INFO: Cleaning up build root ('cleanup_on_success=True') Start: clean chroot INFO: unmounting tmpfs. Finish: clean chroot Finish: run Running RPMResults tool Package info: { "packages": [ { "name": "cutlass-static", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "ppc64le" }, { "name": "cutlass-devel", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "ppc64le" }, { "name": "cutlass", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "src" }, { "name": "cutlass", "epoch": null, "version": "3.5.0", "release": "20240411.1.cu12_4.el8", "arch": "ppc64le" } ] } RPMResults finished